0

The Problem

I have a decently large amount of data that I'm performing a bunch of computation on. My typical M.O. in Mathematica is to use Table to generate a nested array to store my results, and then refer back to the specific columns for additional computation/plotting. Currently, generating the array takes ~40 min. on a pretty beefy personal machine, using ParallelTable for the computation. While prototyping additional analysis I will frequently encounter crashes or hangs which require me to restart the kernel, and perform that 40 min of computation again (which is pretty frustrating).

Ideally, I would be able to export the output of that computation as some kind of file that I could then import into Mathematica instead of having to perform the computation repeatedly. I usually use .csv but this data doesn't work well for that as it has nested array elements which can be vary in size (depending on the initial raw data).

The Question

Is there a file format or storage solution that would allow me to keep my computed data as a nested array so that it can be easily imported again? For example, matlab has the save command which basically takes a snapshot of the runtime which makes recovering from a crash easy.

Alternatively, is there a better or more idiomatic solution for storing the results of my computation that I should be using?

EDIT:

As pointed out in the comments, there are at least 3 functions for saving Mathematica objects, Save, DumpSave, and Put.

Timing the three options on my system indicates that DumpSave is the best option for me: Plot comparing the storage and retrieval time for Save, DumpSave and Put

Additionally, DumpSave results in a much smaller file size compared to Save (~700Mb vs ~1800Mb).

It appears that the main bottleneck is retrieving the data, but it's an order of magnitude faster than repeating the computation every time the kernel crashes.

BesselFunct
  • 490
  • 2
  • 8
  • 4
    You can write the expression to a file using DumpSave, or Save. This question might help. – N.J.Evans Jul 13 '23 at 16:38
  • 1
    Wow, I feel silly for not even having looked in the Docs for Save. – BesselFunct Jul 13 '23 at 16:55
  • 1
    Try Put and Get as shown in the answer to How to export and import these data, a duplicate? – creidhne Jul 13 '23 at 17:01
  • Having tried it, Save is extremely slow for this application (317s), and results in a 1.8 Gb file, 356s Import time. DumpSave does a much better job, at 71s, with a 638 Mb file, 5s Import time. – BesselFunct Jul 13 '23 at 17:03
  • For this data, Put takes 193s, and Get takes 350s to import the object created by Put. So, for my application, DumpSave appears to be a clear winner – BesselFunct Jul 13 '23 at 17:17
  • Actually, it appears that I can't actually Get or Import the output of DumpSave, which is very strange. It doesn't indicate that there's been an error, Get and Import just return empty arrays. – BesselFunct Jul 13 '23 at 17:35
  • I got Get to work with the DumpSave archive, and it takes ~ 354s with the successful import. So, DumpSave is still the winner, but the limit is still retrieving the data. – BesselFunct Jul 13 '23 at 17:45
  • Do you need to load the entire file at once? You could open a stream then just access the portions you need at any given time using SetStreamPosition. You might have to write the binary file yourself. – N.J.Evans Jul 14 '23 at 14:20
  • My data is a time series, and I frequently need to look at a specific metric that I've extracted (like peak position or FWHM) as a function of time, so I want to load the whole time series. I suppose if I transposed the data so that it was wide instead of tall, I could load only the metrics I was interested in (since they're currently the columns). I don't know if that would be more efficient though. – BesselFunct Jul 14 '23 at 14:53

0 Answers0