9

I have a Dataset (non numerical data) which has a ByteCount of about 14GB, When I export it to disc (SSD) the resulting ".m" or ".dat" file (I tried both) has about 3GB and the export. Process lasts about seven minutes.

Is there a way to speed up the export process?

Alexey Popkov
  • 61,809
  • 7
  • 149
  • 368
mgamer
  • 5,593
  • 18
  • 26
  • 2
    Without knowing more about the data you would like to export, it might be hard to tell... – Henrik Schumacher Sep 09 '17 at 10:58
  • @HenrikSchumacher: It is LARGE Dataset, I can provide it via Link, but the data is also public available (JSON Format) at: https://datenspende.algorithmwatch.org/data.html I´m doing resaerch in this project. – mgamer Sep 09 '17 at 11:28
  • @mgamer what you should have done was to share the code you have used to import that public data. – rhermans Sep 09 '17 at 12:42
  • In my experience, a list of associations is a lot faster than a Dataset. And you can Query[ ] it just as well. – Gustavo Delfino Sep 13 '17 at 14:21

1 Answers1

14

Answer

It seems that DumpSave is the fastest, based on my test below.

Albert Retey offers a very relevant comment to this answer, which I think deserves to be highlighted here:

There is an important difference between Export and DumpSave: while Export will write (just) the data to file so it can be re-imported with Import, DumpSave will store the definition for the symbol (here ds) and recreate that definition when the file is loaded with Get. It will overwrite previous definitions for that symbol and when loading data you will need to know which definition a file will restore. So I would suggest to use Export[_,_,"MX"] despite the fact that it is a bit slower due to the extra overhead of the export framework...

Dummy data

ds = Dataset[
   Array[<|
      "Words" -> RandomWord["KnownWords", 100],
      "Country" -> RandomEntity["Country"],
      "Reals" -> RandomReal[{0, 1}, 10^6],
      "Integers" -> RandomInteger[{1, 10}, 10^6],
      "Image" -> RandomImage[1, {100, 100}, ColorSpace -> "RGB"]
      |> &, 20]];

Only 0.3 GB, can't be bothered with more.

UnitConvert[Quantity[N@ByteCount[ds], "Bytes"], "Gigabytes"]
(* Quantity[0.324982, "Gigabytes"] *)

Put performance

AbsoluteTiming[
 Put[ds, "ds.m"];
 UnitConvert[Quantity[N@FileByteCount["ds.m"], "Bytes"], "Gigabytes"]
 ]
(* {131.745, Quantity[0.5267, "Gigabytes"]} *)

BinaryWrite BinarySerialize performance

AbsoluteTiming[
 file = CreateFile["PerformanceGoalSize.bin"];
 ow = OpenWrite[file, BinaryFormat -> True];
 BinaryWrite[ow, BinarySerialize[ds, PerformanceGoal -> "Size"]];
 UnitConvert[Quantity[N@FileByteCount[Close[ow]], "Bytes"], 
  "Gigabytes"]
 ]
(* {15.4984, Quantity[0.169483, "Gigabytes"]} *)

AbsoluteTiming[
 file = CreateFile["PerformanceGoalSpeed.bin"];
 ow = OpenWrite[file, BinaryFormat -> True];
 BinaryWrite[ow, BinarySerialize[ds, PerformanceGoal -> "Speed"]];
 UnitConvert[Quantity[N@FileByteCount[Close[ow]], "Bytes"], 
  "Gigabytes"]
 ]
(* {3.43482, Quantity[0.324826, "Gigabytes"]} *)

Export performance

AbsoluteTiming[
 Export["Export.mx", ds];
 UnitConvert[Quantity[N@FileByteCount["Export.mx"], "Bytes"], 
  "Gigabytes"]
 ]
(* {0.149372, Quantity[0.324832, "Gigabytes"]} *)

DumpSave performance

AbsoluteTiming[
 DumpSave[File["DumpSave.mx"], ds];
 UnitConvert[Quantity[N@FileByteCount["DumpSave.mx"], "Bytes"], 
  "Gigabytes"]
 ]
(* {0.142341, Quantity[0.324832, "Gigabytes"]} *)
rhermans
  • 36,518
  • 4
  • 57
  • 149
  • Thanks you for this valuable hint. Up to now I haven't used the BinaryWrite command and DumpSave. Now I´m going to do so :-) – mgamer Sep 09 '17 at 12:24
  • Is Export["whatever.mx"] the same as DumpSave[File["whatever.mx"]]? – Michael Stern Sep 09 '17 at 12:29
  • @MichaelStern Export["Export.mx", ds]seems to be around the same time that DumpSave["DumpSave.mx", ds]. – rhermans Sep 09 '17 at 12:41
  • 8
    @MichaelStern: there is an important difference between Export and DumpSave: while Export will write (just) the data to file so it can be reimported with Import, DumpSave will store the definition for the symbol (here ds) and recreate that definition when the file is loaded with Get. It will overwrite previous definitions for that symbol and when loading data you will need to know which definition a file will restore. So I would suggest to use Export[_,_,"MX"] despite the fact that it is a bit slower due to the extra overhead of the export framework... – Albert Retey Sep 10 '17 at 08:16
  • @AlbertRetey A very good point! – Leonid Shifrin Sep 10 '17 at 14:34
  • 1
    Another important thing to consider (from the DumpSave documentation): "Files written by DumpSave can only be read on the same type of computer system on which they were written." I believe most ppl here know this, however. – sebhofer Sep 13 '17 at 13:06