17

Mathematica has numerous built-in data for science and math. Every first call to a data function in a new Mathematica session will "re-initialize indices" (download the data). Is there a way to store this data locally to save time on the next data call after Mathematica is restarted?

enter image description here

======== Edit: further thoughts with Leonid's answer ========

After Leonid posted his answer, I decided to update this question – the title and content. I would ask about this anyway – so may as well do it here. The point is not in whether data were cached or not, but really in a method to speed up the load process. I just confused caching and loading. I do not want to take away “best answer” from Spartacus, because he did answer what I asked ideally, thanks. But thank you very much Leonid for posting your answer too. +1 to both of you.

Vitaliy Kaurov
  • 73,078
  • 9
  • 204
  • 355

2 Answers2

16

Please note that parts of the explanations and initialization code is shown together with the main function code, as a single large code block. I will appreciate any help on this matter - I am quite confused, perhaps overlooking something obvious here.


Preamble

While the question has been answered already, the delays with loading built-in data are a pretty serious problem, to my mind. This question prompted me to write a tiny framework, which will considerably speed it up, pretty much for arbitrary built-in data. The techniques involved will be a mix of memoization, some meta-programming, Block trick, and .mx files (DumpSave - Get).

The code

The first ingredient is the symbol-cloning functionality, which is described here (the function clone and related). I will reproduce this here to have a self-contained answer:

Clear[GlobalProperties];
GlobalProperties[] :=
  {OwnValues, DownValues, SubValues, UpValues, NValues, FormatValues, 
      Options, DefaultValues, Attributes};


Clear[unique];
unique[sym_] :=
 ToExpression[
    ToString[Unique[sym]] <> 
       StringReplace[StringJoin[ToString /@ Date[]], "." :> ""]];


Attributes[clone] = {HoldAll};
clone[s_Symbol, new_Symbol: Null] :=
  With[{clone = If[new === Null, unique[Unevaluated[s]], ClearAll[new]; new],
        sopts = Options[Unevaluated[s]]},
     With[{setProp = (#[clone] = (#[s] /. HoldPattern[s] :> clone)) &},
        Map[setProp, DeleteCases[GlobalProperties[], Options]];
        If[sopts =!= {}, Options[clone] = (sopts /. HoldPattern[s] :> clone)];
        HoldPattern[s] :> clone]]

Here are the functions of the "framework". The following is just a helper function to create a file name string:

ClearAll[makeFileName];
Options[makeFileName] = {
   TargetFileName :> Automatic,
   TargetDirectory :> $TemporaryDirectory
 };
makeFileName[dataFunction_Symbol, opts : OptionsPattern[]] :=
  With[{dir = OptionValue[TargetDirectory], fname = OptionValue[TargetFileName]},
      FileNameJoin[{
          dir, 
          If[fname =!= Automatic, 
             fname, 
             (* else *)
             "MemoizedData_" <> ToString[dataFunction] <> ".mx"
          ]}
      ]];

This is the main function. It will create dynamic environment (it returns a closure), inside which the values of a given data-holding function will be memoized. More details below.

ClearAll[generateMemoEnvironment];
SetAttributes[generateMemoEnvironment, HoldFirst];
Options[generateMemoEnvironment] = {
    StorageSymbol :> memoData,
    Sequence @@ Options[makeFileName]
 };
generateMemoEnvironment[env_Symbol, dataFunction_, opts : OptionsPattern[]] :=
   With[{memoSymbol = dataFunction /. clone[dataFunction],
      storageSymbol = OptionValue[StorageSymbol]
   },
   With[{fullname = makeFileName[ dataFunction, opts]},
       storageSymbol /: Save[storageSymbol[dataFunction]] :=
          DumpSave[fullname, {env, memoSymbol, storageSymbol}];
   ];
   env = 
     Function[
        code,
        Block[{dataFunction},
           dataFunction[args___] :=  storageSymbol[dataFunction][args];
           storageSymbol[dataFunction][args___] :=
               storageSymbol[dataFunction][args] = memoSymbol [args];
           code
        ],
        HoldAll]
];

How it works

What happens here is that, when we call generateMemoEnvironment, first the clone for a given symbol (e.g. for ChemicalData) is created (a clone is a symbol with identical global properties. Try f[x_]:=x;f[x_,y_]:=(x+y);clone[f,g] and look at definitions for g, to see what it does. The tricky point here is that the line dataFunction /. clone[dataFunction] calls dataFunction, which allows it to auto-load first. Otherwise, the clone would be empty. The main idea is that now, since we cloned the symbol, we can use Block to Block the main (original) symbol, and temporarily make it memoizing inside Block. Memoization is however done via an intermediate symbol memoSymbol. The main symbol which is kind of a "handle" for everything is given by the StorageSymbol option (I made it default to memoData) - it can be the same for all types of data.

Illustration and workflow

Let me now illustrate how to use this beast. Assume that you loaded the above code on a fresh kernel. Now, we create our dynamic environment:

generateMemoEnvironment[withMemoChemicalData, ChemicalData];

We can check that now the symbol withMemoChemicalData holds a pure function (closure):

 withMemoChemicalData

 (*
  ==>  Function[code$,Block[{ChemicalData},ChemicalData[args$___]:=
       memoData[ChemicalData][args$];memoData[ChemicalData][args$___]:=
      memoData[ChemicalData][args$]=ChemicalData$568201222222151718750[args$];
      code$],HoldAll]
 *)

Now, we execute some code within it (twice):

withMemoChemicalData[
   res1 = ChemicalData[#,"MolecularWeight"]&/@ChemicalData[]
];//Timing

(*
  ==> {20.375,Null}
*)

and again:

withMemoChemicalData[
    res2 = ChemicalData[#,"MolecularWeight"]&/@ChemicalData[]
];//Timing

(*
 ==> {0.125,Null}
*)

The timing difference reflects memoization at work. Note that, you can execute arbitrary code involving ChemicalData inside the environment, and memoization will work!

res1===res2

(*
 ==> True
*)

At the same time, because we used Block, the function ChemicalData did not receive any global definitions (which is easy to check) - which was one of the goals. This means, that our local modifications inside withMemoChemicalData present no danger whatsoever for the rest of the system, and / or other code which may be using ChemicalData from the outside of our environment.

To save the memoized values, you just call Save (I may get flamed for overloading it to work with a single argument, but that can be easily avoided if so desired):

Save[memoData[ChemicalData]];//Timing

(*
  ==> {0.454,Null}
*)

Now, here comes the main point: once you saved it once, you no longer need generateMemoEnvironment - you just need may be makeFileName, to construct the file name automatically. Let us now quit the kernel:

Quit

Now, we execute on a fresh kernel:

Get[makeFileName[ChemicalData]]

and we are ready to go:

withMemoChemicalData[
   res1 = ChemicalData[#,"MolecularWeight"]&/@ChemicalData[]
];//Timing


(*
 ==> {0.125,Null}
*)

Length[res1]

(*
 ==> 43987
*)

Moreover, you can now keep calling properties you did not call before, and those will be also memoized automatically - just wrap your code in withMemoChemicalData. All you have to do is to call Save[memoData[ChemicalData]] periodically, to update the file with newly memoized definitions.

Summary

I presented a tiny framework which may allow hundred-fold speed-ups when working with built-in data. The main ideas involved dynamic environments, metaprogramming, memoization, encapsulation, Block trick, and using .mx files to back up memoized values.

Comments and suggestions welcome!

Leonid Shifrin
  • 114,335
  • 15
  • 329
  • 420
  • Bug reported here: http://meta.stackexchange.com/questions/121004/signs-prevent-code-blocks-from-ending-on-mathjax-enabled-se-sites – Szabolcs Feb 02 '12 at 00:31
  • @Szabolcs Thanks! It crossed my mind thatthe dollar sign might be a problem, but since there is no way for me to not use it, I did not test it. – Leonid Shifrin Feb 02 '12 at 00:37
  • Perhaps it's best to leave the post in this broken state for a while so the SE folks can track the problem down and fix it. – Szabolcs Feb 02 '12 at 00:39
  • @Szabolcs Yes, I also think so. This is actually the first time something like that happened in my practice. Did you encounter similar effects before? – Leonid Shifrin Feb 02 '12 at 00:40
  • No, I have never seen this ... but I don't use $-variables so often – Szabolcs Feb 02 '12 at 00:41
  • @Szabolcs I do sometimes, but I used them e.g. in the post on file-backed lists, and things were fine. Here, I had to use it because I wanted to use $TemporaryDirectory. – Leonid Shifrin Feb 02 '12 at 00:44
  • Great stuff @LeonidShifrin - I updated my question to match your answer ;-) – Vitaliy Kaurov Feb 02 '12 at 01:31
  • @Vitaliy Wow, when a question is updated to match the late-to-the-party answer, that's something! Actually, I think both our answers are quite complementary, and together address the problem comprehensively. I am wondering how much the stuff I described can be actually useful. Right now it looks like pretty much (because you also get huge speed-ups on seubsequent calls, even when all data is loaded and indices initialized). But only many uses for practical applications will tell. – Leonid Shifrin Feb 02 '12 at 01:40
  • What is the point of unique? Isn't Unique already guaranteed to produce an unique identifier? And if really necessary, wouldn't it make more sense to add the date to the string first, and then pass that to Unique, in order to avoid creating a symbol that is never again referenced? (That is, unique[sym_Symbol]:=Unique[ToString[sym]<>yourDateCode<>"$"]) – celtschk Jun 01 '12 at 07:10
  • @celtschk Unique only guarantees the uniqueness of the generated symbol within a single Mathematica session. unique generates symbols which, for all practical purposes, will be unique across multiple Mathematica sessions. As for your last comment, I don't quite follow: the way unique is coded, it does not produce any intermediate symbols, and those which it produces are referenced in the code for clone, and then other functions which use clone. – Leonid Shifrin Jun 01 '12 at 08:42
  • @celtschk Ok, I see now - your second point is valid. I do indeed create a symbol which is not referenced before. Will change that soon. Thanks. – Leonid Shifrin Jun 01 '12 at 08:55
  • Thanks for the explanation of unique. – celtschk Jun 01 '12 at 09:21
  • Somehow I missed this post. Wow! I don't have time to try it but I trust you. Once again you take everything to another level, Mathematica Professor. :-) – Mr.Wizard Jun 12 '12 at 14:44
  • @Mr.Wizard Thanks :). I later found out that for FinancialData one needs some additional steps since it becomes Locked after first auto-load. Not sure if adding the work-around for it to the post is a good idea though - I guess it was made Locked for a reason... – Leonid Shifrin Jun 12 '12 at 14:53
  • 2
    @Leonid: impressive, as usual. I also think it could be really useful for practical use at least for rather 'static' data like ChemicalData. The main usage of FinancialData is to get data for time ranges and the caching of these might eat up a lot of memory and thus should probably be used with care. Might be true for other data functions like WeatherData as well. (I don't believe that FinancialData is locked because the developers did expect you to come along with this answer, though). It leaves a big question mark about why these built ins aren't more efficient in the first place... – Albert Retey Jun 12 '12 at 22:16
  • @Albert Thanks :). "It leaves a big question mark about why these builtins aren't more efficient in the first place" - totally agree. It looks like an over-sight for me. Also, probably one of the cases where the main bottleneck is in the symbolic top-level Mathematica code. As for caching large data, I agree. – Leonid Shifrin Jun 12 '12 at 22:21
14

Initializing is not the same as downloading:

Mathematica graphics

I believe you are witnessing the data being unpacked for use.

Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371