25

Mathematica has a cross platform data exchange format, WDX. Unfortunately importing or exporting large data from/to WDX is very slow. Using MX files is very fast, but they are not compatible across different computer architectures (32 or 64 bit).

Sometimes it is suggested to Compress the data and write out or read it in manually.

Question: How can we extend Import and Export to allow convenient and fast importing/exporting of arbitrary Mathematica expressions using Compress?

The aim is to define a fast-to-load format, and make importing from it as easy as Import["data.mmaz"] or Import["http://server.com/myfile.mmaz"], by integrating it into the Import/Export framework. Ideally the format should be recognized based on a file extension.

If there is a better solution than using Compress, I'd like to hear it!

Szabolcs
  • 234,956
  • 30
  • 623
  • 1,263
  • I'm planning to do this today. I'll probably post an answer in a day or so if I manage. But I'd like to see some better solutions first :-) – Szabolcs Feb 03 '12 at 10:47
  • Not an answer for you but I moved all my data from WDX to a MySQL database a couple of years ago. One of the best things I did. – Mike Honeychurch Feb 03 '12 at 12:03
  • @Mike I didn't forget about your suggestion, but this question is mainly about how to develop an Import/Export converter. I thought it would be good to have a practical tutorial on this here at Mma.SE – Szabolcs Feb 03 '12 at 12:07
  • @Szabolcs, further down you mention something about tutorials, have looked at these tutorial/DevelopingAnExportConverter, tutorial/DevelopingAnImportConverter? –  Feb 03 '12 at 13:29
  • @ruebenko Yes, I am reading those now. This was more of an experiment about this: http://meta.mathematica.stackexchange.com/q/194/12 But now I think I asked the wrong question. – Szabolcs Feb 03 '12 at 13:35
  • I'm interested in a faster alternative to wdx (ideally as fast as .mx) that is portable. It doesn't have to be Compressed. Any hints? – masterxilo Sep 13 '16 at 16:40
  • @masterxilo Since v10, MX is portable between different OS for as long as the pointer size (64 or 32 bit) is the same. It is also backwards compatible: old MX files can be loaded in newer Mathematica. Support has confirmed this to me once, and I posted it here, but can't find it now. – Szabolcs Sep 13 '16 at 20:12
  • @masterxilo Otherwise the Compress trick is good: reading files is very fast, much faster than WDX. Writing can be a bit slow sometimes. I don't know of anything else except plain .m files. – Szabolcs Sep 13 '16 at 20:13

3 Answers3

17

In this case, developing the converters is dead-easy (which is not a good thing IMO, since it means that we really don't utilize the power of Import/Export framework, but rather are adding syntactic sugar):

CompressedFormat`CompressedFormatImport[filename_String, options___] :=
    {"Data" -> Uncompress@Import[filename, "String"]};

CompressedFormat`CompressedFormatExport[filename_String, data_, opts___] :=
    Export[filename, Compress@data, "String"];

ImportExport`RegisterImport[
   "CompressedFormat",
   CompressedFormat`CompressedFormatImport
]

ImportExport`RegisterExport[
   "CompressedFormat", 
   CompressedFormat`CompressedFormatExport 
]

Example:

file = $TemporaryPrefix <> "test";
Export[file, Range[1000000], "CompressedFormat"];
Import[file, {"CompressedFormat", "Data"}] // Length

(* 
  ==>  1000000
*)

That said, I think using Import - Export framework makes much more sense for specific formats where you can specify distinct elements and the framework makes it convenient to create importers for those elements (possibly avoiding full imports when unnecessary). So, for a meaningful exposition of the importer-writing procedure using Import/Export framework, some e.g. particular graphics of numerical format would be a better choice IMO, because your stated goal is too general for that.

For that matter, I think that my large data framework (perhaps when extended and generalized) will make for a much better case for Import/Export framework use, as well as cover your use case and many more, because it:

  • Does use Compress under the cover
  • Uses lazy loading, which opens many possibilities to define certain elements for Import/Export, which are loaded individually / efficiently
  • Does not have a limitation that the file must fit in memory
  • Can be very fast for large files
  • In practice, we use large files much more frequently than carry them around from platform to platform. My framework can switch from extremely fast .mx files to Compress-ed non-.mx files very easily, and the details can be completely hidden from the user, who will just use Import in all cases, and have great performance.

In other words, I feel that the direction I outlined there, does contain your suggestion as a special case, and is much more fruitful both for further development of the large-data framework / file format, and for the utilization of the power of the Import/Export framework (and, sure enough, this is the direction I will be extending the large-data framework in the future).

Leonid Shifrin
  • 114,335
  • 15
  • 329
  • 420
  • @Szabolcs I did benchmark that before posting - Import was 3 times faster. As I said, you are mixing two questions: essential question on loading data fast and cross-platform, with a question of using Import - Export, which in this case is IMO more like syntactic sugar (because your format does not have any complex structure). I was mostly answering the Import/Export part, and did not intend to fully answer the loading / saving part (for reasons I outlined). Will look into automatic file extension resolution. Feel free to edit and add benchmarks and whatever else you'd like. – Leonid Shifrin Feb 03 '12 at 13:44
  • Yes, you are right, the question was not well formulated. Now it feels like I should not have asked, epsecially since the tutorials in the documentation are good enough. – Szabolcs Feb 03 '12 at 13:49
  • @Szabolcs I think the part regarding the fast cross-platform load/save is a good and valid question - I just think that we'd be better off addressing the more general one by extending the framework I posted. As to the Import/Export, some simple but composite format like some custom tabular data would probably be better, but then there is an example like that in the docs, as you said. – Leonid Shifrin Feb 03 '12 at 14:05
  • @Szabolcs I actually think this was a useful experiment, and now I think that the answer to that question of Mr.Wizard is: ask a question in two cases - either when you do know the answer really well, but think that the question and answers would benefit the community, or when you don't, and need it for yourself. Both cases are fine because the asker has put in a lot of thought in either case. Mixing them is not since then this may not be the case. I will post a version of this on meta as another answer, to add to yours. – Leonid Shifrin Feb 03 '12 at 14:10
  • I'll clean up the mess later, perhaps European-tonight ... got to finish some work today – Szabolcs Feb 03 '12 at 14:35
  • @Szabolcs I am not sure why you consider this a mess. I have lots of code which uses Import with .mx files at the moment, and I would rather not change it to something else in the middle of the project if I wish to transfer the data elsewhere. So I found this useful (and surprisingly easy), and am thankful for the question (and Leonid's answer). – acl Feb 03 '12 at 15:36
  • Regarding file extensions, I found that the mappings are in SystemFiles/Formats/ExtensionMappings.m. They are loaded into the variable System`ConvertersDump`ExtensionMappings, which seems to not be overwritten, only appended to (i.e. manual modifications stay). If I include an extension in one of these places, Export will work correctly without a need to explicitly specify the format. Import will interpret the data as string though. Do you think it's okay to ask a new question on this? – Szabolcs Feb 03 '12 at 16:40
  • @Szabolcs I went the other road and hacked a bit the internal code. I was basically appending System`ConvertersDump`ExtensionMappings directly, but found that this was not enough. Probably your method is preferable. But, for this case, I would simply create a dynamic environment, in which would explicitly overload Import and Export, perhaps through Gayley - Villegas trick. This has an advantage that whoever uses it, does not have to change anything in the installation, but a possible disadvantage that this code has to be run every time. – Leonid Shifrin Feb 03 '12 at 16:52
  • @Szabolcs Well - I'd just repeat what I suggested: ask yourself - did you put in enough effort :)? Only you know the answer - if yes, then go ahead and ask. But you don't have to listen to me :) – Leonid Shifrin Feb 03 '12 at 16:56
12

One simple way to store data in compressed form could use the following:

ExportCompressed[filename_,data_]:=
    Export[filename,"Uncompress@"<>"\""<>Compress[data]<>"\"","String"]

This simply compresses and prepends the Uncompress statement to the resulting string. You can now simply use Get[] to import your data.

I use this to store compressed graphics expressions. Compressing can take a long time (I´d like to see that sped up big time, because several minutes for a few MB of graphics expression is way too long), but mostly you get very good compression.

On the other hand, import of these expressions is really fast. This seems kind of related to the WDX performance.

rcollyer
  • 33,976
  • 7
  • 92
  • 191
Yves Klett
  • 15,383
  • 5
  • 57
  • 124
  • 3
    Hi Yves, welcome to the party! –  Feb 03 '12 at 12:26
  • 3
    Finally, a question that was not way over my head and/or answered immediately - just what I was looking for to get my foot into the door. – Yves Klett Feb 03 '12 at 12:50
  • Hi Yves, welcome to Mathematica.SE! Could you complement this with some code that integrates this into the Import and Export functions? – Szabolcs Feb 03 '12 at 12:56
  • 1
    @Szabolcs: If you Import[] such a file, you get a string. You can then use ToExpression to evaluate this and get back your uncompressed. Using Get[] does the evaluation on the spot. I am not sure how to properly splice this transparently into Import and Export. – Yves Klett Feb 03 '12 at 13:18
  • @Yves I know that, but directly integrating into the Import/Export framework has many advantages: I don't need to change my code when I change the file format: just pass it different file names. Also, web import will work out of the box. – Szabolcs Feb 03 '12 at 13:21
  • @Szabolcs: I certainly agree. My version is certainly a quick´n dirty fix that somehow survived the ages in my packages. Apart from that, the question is why WDX and Compress are so awfully (sometimes unuseably) slow in some cases. – Yves Klett Feb 03 '12 at 13:27
1

A bit late, but the extension can be registered using RegisterFormat from the Wolfram Function Repository

https://resources.wolframcloud.com/FunctionRepository/resources/RegisterFormat

GenericAccountName
  • 1,543
  • 9
  • 12