36

I've got this CSV file I've imported that has tens of millions of lines in it. It takes around 20 minutes to import. I've been working with it for a while and have the processed data spread out in a bunch of variables.

Now Windows is bugging me that I need to restart the computer.

I thought about gathering all the data up in a table and then export and import it, but that would be a lot of hassle and take ages. I also thought about just saving the notebook and re-evaluate it, but with this amount of data that will also take a long time.

I wonder what is the best way to save all the data so that I can get it back after having restarted the computer? Something fast and with minimum of hassle would be great.

PS. I have no idea how to tag this thing. There is apparently no big-data tag.

Sjoerd C. de Vries
  • 65,815
  • 14
  • 188
  • 323
Mr Alpha
  • 3,156
  • 2
  • 24
  • 33

2 Answers2

46

Assuming you haven't placed your variables in a non-standard context you can save them all at once using DumpSave's second syntax form, which saves everything in the indicated context.

Quit[] (* start a fresh kernel *)

x = 1; (* define some symbols *)
y = 2;
z[x_] := x^2

Names["Global`*"] (* Check they're there *)

(* ==> {"x", "y", "z"}  *)

(* Save everything in the context *)
DumpSave["C:\\Users\\Sjoerd\\Desktop\\dump.mx", "Global`"];    

Quit[] (* kill kernel to simulate a new start *)

Names["Global`*"] (* Are we clean? *)
(* ==> {} *)

(* Get the save symbols *)
<< "C:\\Users\\Sjoerd\\Desktop\\dump.mx"

(* Are they there? *)
Names["Global`*"]    
(* ==> {"x", "y", "z"} *)

z[y]    
(* ==> 4 *)
Sjoerd C. de Vries
  • 65,815
  • 14
  • 188
  • 323
37

You can export to the MX format:

Export["out.mx", data, "MX"]

Be aware that this format is not portable between different computer architectures (e.g. 32/64 bit).

Import using

data = Import["out.mx"];

This is the fastest available format in Mathematica. Most likely you can't do better than this, unless you write an interface to a specialized external library.

Related reading:

Szabolcs
  • 234,956
  • 30
  • 623
  • 1,263
  • 5
    @Yves I feel we need a "canonical question" for this, but I wasn't able to find an existing question which would be suitable without a lot of editing. Maybe eventually we can repurpose this one. This question is quite straightforward, without many extra details, but it'll be necessary to write a good answer which explains the caveats of MX, the difference between DumpSave and Export and mentions Export["out.mz", Compress[expr], "String"] ... – Szabolcs May 11 '12 at 12:52
  • ... as a cross-platform alternative (something like Mr. Wizard's answer, made a bit more beginner friendly). – Szabolcs May 11 '12 at 12:53
  • This still means I have to do it separately for each variable or collect all the data up into one table? – Mr Alpha May 11 '12 at 13:23
  • A .mx file is relatively compact, but it isn't compressed. Export["file.mx.gz", expr] or Export["file.mx", Compress[expr]] are also allowable. I think the .mx.gz should be fastest to import, while the smallest file size seems to result from Export["file.mx", Compress[expr, Method -> {"Version" -> 2}]]. – Oleksandr R. May 11 '12 at 13:24
  • @MrAlpha If you use Export, you have to build a single expression that you can export. For example, if you have variables a, b, c, then Export["data.mx", {a,b,c}, "MX"]. If you use DumpSave, then this is not necessary (please check the DumpSave docs, however, it'll recreate the excat same variable names that you had before when you load the MX file, and there's no direct way to check what those names were in case you forgot them) – Szabolcs May 11 '12 at 13:26
  • @OleksandrR. I remember I did benchmark .mx.gz vs .mx for some large files, and the gzipped one wasn't faster (it was about the same). I don't remember the details though, and I don't remember the type of data. I know from experience that huge text files (plain ascii) load faster when gzipped, that's exactly why I tried the same with MX. Did you benchmark this? – Szabolcs May 11 '12 at 13:28
  • No, I didn't benchmark it (don't have time now). My comment was just motivated by the idea that the gzip step shouldn't incur significant CPU or memory overhead, and for big files it reduces the amount of data to be written. If you're disk-bound, that could be useful. Otherwise I'm sure you're right that it's no faster than uncompressed .mx. – Oleksandr R. May 11 '12 at 13:31
  • @OleksandrR. What if some special technique is used for reading the file like memory mapping? I don't have any experience with this. Once I played a little with it, and I didn't manage to beat MX's performance. – Szabolcs May 11 '12 at 13:32
  • I don't know whether memory mapping is used or not, but if it is I suppose it could speed up retrieval of specific expressions from the .mx file. If you want to implement memory mapped files yourself I think it would be necessary to know the in-memory layout of Mathematica expressions to get better performance than .mx can provide (by optimizing the serialization/deserialization). Personally I wouldn't be willing to attempt that unless I had access to the Mathematica kernel source code; the details you need are just too undocumented otherwise. – Oleksandr R. May 11 '12 at 13:54
  • @OleksandrR. I strongly doubt that anything we can write has a potential to beat .mx load time, for a simple reason that .mx files load directly into in-memory Mathematica structures, completely by-passing the main evaluator, while even the LibraryLink interface does not, AFAIK, provide all the interface necessary to efficiently create general Mathematica expressions, and assign them to variables in Mathematica workspace. – Leonid Shifrin May 11 '12 at 14:10
  • @LeonidShifrin yes, exactly. In the past I thought a little bit about how you could get a pointer to the memory referenced by packed arrays and dump this into a (memory-mapped) file to get disk-based arrays, but to make anything useful out of that you'd also need to know how Lists are represented, which is of course much more difficult. It was soon after I convinced myself that this couldn't be made to work that you made your nice post about doing it with .mx files instead. – Oleksandr R. May 11 '12 at 14:34
  • 2
    @OleksandrR. Of course, it would be very nice if we could have a supported interface (or, rather, a specification) for creation of .mx files (I mean, not from within Mathematica). This would allow one to write very fast converters from a given external format to .mx, which would load way faster (and save tones of memory as well) than whatever we can now write using the Import-Export framework. – Leonid Shifrin May 11 '12 at 14:41
  • @Leonid It's possible to get direct memory access to packed tensors with LibraryLink. If the data is in this format, the advantage of writing an import/exporter would be having a very fast file format that other software can read as well. PyTables is said to be very fast (in comparison Mathematica's HDF5 import/export is not that fast..). (I didn't need this so far though.) – Szabolcs May 11 '12 at 14:44
  • @Szabolcs I am aware of that, and I agree. But, this is not enough, and for general expression-building MathLink is used by LibraryLink, and things are both copied and sent over the link (albeit to the same process). I did not benchmark how much slower this is, but I would not expect miracles. And, with .mx files, these expressions are already assigned to variables / symbols - something which LibraryLink won't do for you. This is why I said "general Mathematica expressions". – Leonid Shifrin May 11 '12 at 14:50