Parallelize evaluation of function with memoization

Question

I have a complicated function that I need multiple times, so I want to memoize it and have the first evaluation done in parallel. Unlike in my example below it's not a continuous function, so interpolation is not an option. (In fact, it's values are also functions.) The naive approach clearly does not work, because the memoized value is then only known on the kernel it was evaluated on:

LaunchKernels[2];
f[x_] := f[x] = (Pause[3]; N[Sin[x]]); (*Expensive calculation*)
ParallelDo[f[x], {x, 3}];

ParallelEvaluate[AbsoluteTiming[f[1]]]

(* ==> {{3.000632, 0.841471}, {0.000024, 0.841471}} *)

I believe I found a workaround by doing something like this:

f[x_] := (Pause[3]; N[Sin[x]]); (*Expensive calculation - NO memoization*)
t = ParallelTable[f[x], {x, 3}];
Do[f[x] = t[[x]], {x,3}];

Using SetSharedFunction[f] before ParallelDo also yields a non-optimal result:

{{0.012051, 0.841471}, {0.012202, 0.841471}}

0.01 s is still a long time to look up a value (see above, it should be < 1 ms). Is there something more elegant or do I have to keep it like this?

Edit: Just to be clear, the workaround above without Shared Functions works, it runs in parallel and the main kernel knows the values afterwards, but it strikes me as an ugly hack. I was wondering if there was an "official" solution.

The problem with SetSharedFunction is that it forces f to be evaluated on the main kernel: this means that you will lose parallelization and that there is extra communication overhead. Would a solution where long evaluations are done on the subkernels, but retrieval of memoized values is done on the main kernel be useful for you? If the function is expensive enough, this should be useful. If it is not extremely expensive to calculate, the communication overhead might hurt more (as you noticed above where you got ~12 ms timings) — Szabolcs, Feb 03 '12 at 18:17
@Szabolcs Sounds like a very meaningful suggestion. Why don't you make an answer out of it? — Leonid Shifrin, Feb 03 '12 at 18:25
@Szabolcs I for one could definitely use something like that. — David Z, Feb 03 '12 at 18:30
@Leonid and David I wrote an answer now. I got confused by the fact that undefined shared functions return Null on parallel kernels, and couldn't figure out why my initial code did not work. — Szabolcs, Feb 03 '12 at 18:49

score 39 · Accepted Answer · edited Jun 16 '20 at 09:23

The problem with SetSharedFunction is that it forces f to be evaluated on the main kernel: this means that if you simply do

SetSharedFunction[f]

then you will lose parallelization (a timing of ParallelTable[f[x], {x, 3}] will give about 9 seconds).

This property of SetSharedFunction is not clear from the documentation in my opinion. I learned about it from this answer. It is also not clear if the behaviour is the same in version 7 (can someone test? I tested my answer on version 8 only).

We can however store the memoized vales on the main kernel, while evaluating the expensive computations on the paralell kernels. Here's one way to do it:

f[x_] :=
 With[{result = g[x]},
  If[result === Null,
   g[x] = (Pause[3]; N[Sin[x]]),
   result
  ]
 ]
SetSharedFunction[g]

Here I used the special property of shared functions that they return Null on the paralell kernels when they have no value for a given argument.

The first time we run this, we get a 6 s timing, as expected:

AbsoluteTiming@ParallelTable[f[x], {x, 3}]
(* ==> {6.0533462, {0.841471, 0.909297, 0.14112}} *)

The second time it will be very fast:

AbsoluteTiming@ParallelTable[f[x], {x, 3}]
(* ==> {0.0260015, {0.841471, 0.909297, 0.14112}} *)

However, as you noticed, evaluating f on the parallel kernels is a bit slow. On the main kernel it's much faster. This is due to the communication overhead: every time f is evaluated or changed on a subkernel, it needs to communicate with the main kernel. The slowdown does not really matter if f is really expensive (like the 3 seconds in your toy example), but it can be significant if f is very fast to execute (comparable in time to the apparently ~10 ms communication overhead)

ParallelTable[AbsoluteTiming@f[x], {x, 3}]
(* ==> {{0.0100006, 0.841471}, {0.0110006, 0.909297}, {0.0110007,  0.14112}} *)
Table[AbsoluteTiming@f[x], {x, 3}]
(* ==> {{0., 0.841471}, {0., 0.909297}, {0., 0.14112}} *)

Finally a note about benchmarking: in general, measuring very short times like the 10 ms here should be done with care. On older versions of Windows, the timer resolution is only 15 ms. On Windows 7, the resolution is much better. These timings are from Windows 7.

Update:

Based on @Volker's @Leonid's suggestion in the comments, and @Volker's original solution, we can combine subkerel and main kernel caching like this:

f[x_] := With[{result = g[x]}, (f[x] = result) /; result =!= Null];
f[x_] := g[x] = f[x] = (Pause[3]; N[Sin[x]])

Packaged up solution

We can bundle all these ideas into a single memoization function (see the code at the end of the post).

Here is an example use:

Clear[f]
f[x_] := (Pause[3]; Sin[x])
AbsoluteTiming@ParallelTable[AbsoluteTiming@pm[f][Mod[x, 3]], {x, 15}]
(* ==>
{6.0683471, {{3.0181726, Sin[1]}, {0.0110007, Sin[2]}, 
             {3.0181726, 0}, {0., Sin[1]}, {3.0191727, Sin[2]}, 
             {3.0181726, 0}, {0.0110007, Sin[1]}, {0., Sin[2]}, 
             {0., 0}, {0., Sin[1]}, {0., Sin[2]}, 
             {0., 0}, {0., Sin[1]}, {0., Sin[2]}, {0., 0}}}
*)

The function simply needs to be called using pm[f][x] instead of f[x]. Memoized values are associated with f as UpValues, so I thought automatic distribution of definitions should take care of both synchronizing memoized values and clearing them when necessary. Unfortunately this mechanism doesn't seem to be reliable (sometimes it works, sometimes it doesn't), so I provided a function clearParallelCache[f] that will clear all memoized values on all kernels.

Caching happens at two levels: on the main kernel level and on subkernels. Computed or main-kernel-cached values are copied to the subkernels as soon as possible. This is visible in the timings of the example above. Sometimes retrieving cached values takes 10 ms, but eventually it becomes very fast. Note that it might happen that the two kernels will each compute the same value (if they start computing it at the same time). This can sometimes be avoided by using a different setting for the Method option of Parallelize (depending on the structure of the input data).

For simplicity, I restricted pm to only work with functions that take a single numerical argument (and return anything). This is to avoid having to deal with more complex conditional definitions (especially cases when the function won't evaluate for certain argument types). It could safely be changed to accept e.g. a vector or matrix of values instead.

The code

pm[f_Symbol][x_?NumericQ] :=
 With[{result = memo[f, x]},
  If[result === Null,
   With[{value = valueOnMain[f, x]},
    If[value === Null,
     f /: memo[f, x] = setValueOnMain[f, x, f[x]],
     f /: memo[f, x] = value]
    ],
   result]
  ]
memo[___] := Null
DistributeDefinitions[memo];
valueOnMain[f_, x_] := memo[f, x]
setValueOnMain[f_, x_, fx_] := f /: memo[f, x] = fx
SetSharedFunction[valueOnMain, setValueOnMain]
clearParallelCache[f_Symbol] := 
  (UpValues[f] = {}; ParallelEvaluate[UpValues[f] = {}];)

Can't you then just distribute the definition of g instead of making it a shared function? But actually it might be sufficient if the main kernel knows the values after the hard part of the computation. — Volker, Feb 03 '12 at 18:58
@Volker Distributing definitions is not enough (even if DistributeDefinitions were directly available on subkernels---it is not), we'd need to merge them, merge all results from all subkernels. Generally, we have a tradeoff here: we can either synchronize the memoized values often, like I did, and accept the communication overhead, or synchronize them only occasionally, and accept that sometimes two different kernels will compute the slow-to-evaluate function for the same argument. — Szabolcs, Feb 03 '12 at 19:03
Maybe the best solution is to mix my approach and yours: after the ParalellTable has finished, distribute all memoized values to all kernels, and in subsequent computations only access the main kernel from subkernels when really needed. This is what you actually mean, right? — Szabolcs, Feb 03 '12 at 19:05
@Volker I must leave for an hour, but I'll try to write you a generalized and convenient use solution afterwards. — Szabolcs, Feb 03 '12 at 19:16
@Szabolcs Here is a version which solves both problems: ClearAll[f, g]; f[x_] := With[{result = g[x]}, f[x] = result /; result =!= Null];f[x_] := g[x] = f[x] = (Pause[3]; N[Sin[x]]);SetSharedFunction[g]. The f on the sub-kernel will use its memoized value if that exists. If not, it will "ask" g (so only in this case there will be communication overhead). This has an advantage that there is no need to explicitly distribute definitions after the fact (for the cost of being a tiny bit slower until all kernels "learn" a given value of f, but this is really a small cost IMO). — Leonid Shifrin, Feb 03 '12 at 19:43
Have you seen my suggestion? I think it is simpler (to read, write and understand) than what you finally eneded up with, while conceptually the same idea of using a two-level cache. I included it in comments rather than a separate answer, having in mind that you could use it in yours if you wish. — Leonid Shifrin, Feb 04 '12 at 12:06
@Leonid I have seen it last night, and I concluded that it doesn't work in some situations. Then I thought it's better to sleep on it. I must check it again. I am looking at it right now and will update. — Szabolcs, Feb 04 '12 at 12:08
@Leonid Your version, as you posted it here simply does not work correctly on this machine I'm testing on. It does a lot of extra calculations and takes 9 seconds (on this dual-core). However, if I add a few debug Prints, then it does work. It's just driving me crazy. It's like quantum mechanics: if I look at it, it behaves differently than when you don't look at it. I need a break now, and if I still can't debug it this afternoon, I'll mail you. — Szabolcs, Feb 04 '12 at 12:44
@Leonid You forgot some parens ... it should be (f[x] = result) /; result =!= Null ... — Szabolcs, Feb 04 '12 at 12:47
Looks like you are right about the parens, sorry for this mess. The reason it looked like working was that only the last definition remained, since condition was not interpreted as a part of the pattern. So it did not produce nonsense, just not doing what I intended. But with the parens, it seems to work correctly. — Leonid Shifrin, Feb 04 '12 at 12:59
@Szabolcs Wow, thanks for the elaborate answer! Great work everybody — Volker, Feb 04 '12 at 16:37
This variant ClearAll[f, g]; f[x_] := With[{result = g[x]}, (f[x] = result) /; result =!= Null && Head[result] =!= g]; f[x_] := g[x] = f[x] = (Pause[3]; N[Sin[x]]); SetSharedFunction[g]; works both, for ParallelMap[f, Range[10]] and Map[f, Range[10]]. Code without && Head[result] =!= g does not work if you evaluate f[1] in main kernel first. The cause lays in the feature of WM: consider an irreducible expression foo[1,2,3]; evaluation of foo[1,2,3] in a parallel kernel results in Null, while evaluation of foo[1,2,3] in main kernel results in foo[1,2,3]. — Artem Vorozhtsov, May 06 '23 at 11:18

score 15 · Answer 2 · answered Sep 01 '16 at 14:47

The problem with a normal shared memoized function definition

f[x_] := f[x] = (Pause[3]; N[Sin[x]])

is indeed that any evaluation of f[n] on a parallel subkernel causes the rhs f[x] = (Pause[3]; N[Sin[x]]) to be sent back and evaluated on the master kernel.

There is an undocumented feature that if such a definition is made on a subkernel, the rhs will also be flagged for evaluation on the subkernel. So you could do instead

SetSharedFunction[f]; (* no definitions on master *)
ParallelEvaluate[f[x_] := f[x] = (Pause[3]; N[Sin[x]]), First[Kernels[]]]

If you look at the result, you see

?f
f[x_]:=Parallel`Developer`SendBack[f[x]=(Pause[3];N[Sin[x]])]

Knowing this you can get the desired effect with a definition on the master kernel:

g[x_] := Parallel`Developer`SendBack[g[x] = (Pause[3]; N[Sin[x]])]
SetSharedFunction[g]

and then

In[9]:= AbsoluteTiming[ParallelTable[g[Mod[i, 4]], {i, 40}]]

Out[9]= {3.13635, {0.841471, 0.909297, 0.14112, 0., 0.841471, 
  0.909297, 0.14112, 0., 0.841471, 0.909297, 0.14112, 0., 0.841471, 
  0.909297, 0.14112, 0., 0.841471, 0.909297, 0.14112, 0., 0.841471, 
  0.909297, 0.14112, 0., 0.841471, 0.909297, 0.14112, 0., 0.841471, 
  0.909297, 0.14112, 0., 0.841471, 0.909297, 0.14112, 0., 0.841471, 
  0.909297, 0.14112, 0.}}

?g
g[0]=0.

g[1]=0.841471

g[2]=0.909297

g[3]=0.14112

g[x_]:=Parallel`Developer`SendBack[g[x]=(Pause[3];N[Sin[x]])]

Each evaluation of g[n] on the subkernel will then first send g[n] to the master. If there is already a specific value for that n it is sent back; otherwise the expression g[x] = (Pause[3]; N[Sin[x]]) is sent back. The subkernel then evaluates its rhs and makes the definition g[x]=val, which expression is now sent back to the master where it is performed, for all kernels to see.

Welcome to Mathematica.SE and thank you for sharing this information. Very interesting and useful feature! — Szabolcs, Sep 01 '16 at 14:57
The timing 3.13635 above corresponds to a 4 subkernels machine (or more). — andre314, Sep 02 '16 at 23:03

Aaron · Answer 3 · 2016-01-14T03:17:01.823

8

I realize this is an old question, but I recently had the same issue and have come across (link to google groups question) what I think is a cleaner solution. I don't want to take credit for coming up with that solution, but I thought it would be helpful to add it to this site.

I'll use a simple example function to demonstrate.

 f[x_] := f[x] = x
 ParallelDo[f[i], {i, 1, 3}]

The problem is that, while f[1], f[2], and f[3] have now been computed, they are stored on the subkernels and not on the main one. But we can access the values using ParallelEvaluate:

 DownValues[f] = DeleteDuplicates[Flatten[ParallelEvaluate[DownValues[f]]]]

DeleteDuplicates is needed because each subkernel includes the original function definition as a down-value.

edited Jan 14 '16 at 03:17

answered Jan 13 '16 at 22:10

Aaron

81
1
4

Welcome to Mathematica.SE! I suggest the following: 1) As you receive help, try to give it too, by answering questions in your area of expertise. 2) Take the tour! 3) When you see good questions and answers, vote them up by clicking the gray triangles, because the credibility of the system is based on the reputation gained by users sharing their knowledge. Also, please remember to accept the answer, if any, that solves your problem, by clicking the checkmark sign! – Michael E2 Jan 13 '16 at 23:17

Mr.Wizard · Answer 4 · 2012-02-03T19:52:00.207

4

Here is an awkward but potentially useful method:

SetAttributes[newSet, HoldFirst]
newSet[lhs_, rhs : Except[_EvaluationObject]] := Set[lhs, rhs]

LaunchKernels[3];

f[x_] := f[x] ~newSet~ ParallelSubmit@Sum[N@Sin[i], {i, 1000000 x}]

WaitAll@Table[f[x], {x, {1, 2, 3}}] // AbsoluteTiming
WaitAll@Table[f[x], {x, {3, 2, 1}}] // AbsoluteTiming

{2.7351565, {-0.11711, -0.103631, 0.0387313}}

{0., {0.0387313, -0.103631, -0.11711}}

Depending on your application this may be completely useless, since nothing not already memoized will evaluate until you use WaitAll on explicit output. On the other hand it performs well, and you can do without the Parallel* version of functions, which at times may be useful.

edited Feb 03 '12 at 19:52

answered Feb 03 '12 at 18:08

Mr.Wizard

271,378
34
587
1,371

1

See my comment above on what's wrong with this. In version 8 you lose all parallelization (just time the ParallelDo --- you'll get 9 seconds). EDIT I just realized you have v7, is it different there? – Szabolcs Feb 03 '12 at 18:19
@Szabolcs please tell me if you think this revised answer is of value. (I already voted for your answer.) – Mr.Wizard Feb 03 '12 at 19:55
2

I was lucky enough to have witnessed the moment when newSet was not yet infixed. It was, however, hopelessly brief. +1. – Leonid Shifrin Feb 03 '12 at 19:58
@Leonid LOL -- I actually started with it infixed, then thought it would just stir up trouble, then rethought that it looked more like Set if it was infix. Maybe I should change it back again... – Mr.Wizard Feb 03 '12 at 20:04

Parallelize evaluation of function with memoization

4 Answers4

Packaged up solution

The code

Linked

Related