14

I'm trying to reduce the computation time of a MCMC simulation. Essentially I have a set of particles performing a random walk in a periodic random potential. The particles are independent and every 100 timesteps I perform some checks and modify the state of some of the particles.

This seemed to me a perfect case to be optimized by parallelizing the simulation, but whatever I try I always make things just a little bit worse.

Here is in short detail what I'm doing.

First I define my random potential function U[x] as an interpolation over a simple discrete time random walk

rw[L_] := Accumulate[RandomVariate[NormalDistribution[0, 1], L]]
myrw = rw[100];
myrw = Rescale[myrw];
ifun = Interpolation[
   Transpose[{Range[1, 200], Join[Reverse@myrw, myrw]}], 
   PeriodicInterpolation -> True];
U[x_] := ifun[x];

Here are the two functions needed to perform a single time step evolution of a particle on my potential (Metropolis-Hastings algorithm):

p[x_, prop_, T_] := Exp[-(U[prop] - U[x])/T];
MCMCEvo[x_, T_] :=
 Module[{prop},
  prop = RandomVariate[NormalDistribution[x, 0.1]];
  Return[If[RandomReal[] < p[x, prop, T], prop, x]]
  ]

Then I have a function that checks the state of the particles and with some probability modifies it and then performs 100 time steps of evolution for the whole population. It then returns the state of the particles as a function of this 100-time steps evolution.

PopulationEvolve[pop_, pdiv_, plife_] := Module[{newpop},
  (*I leave this SomeFunctionOfTheStates here for the sake of completeness, but commenting it and setting newpop=pop doesn't change anything in the timings*)
  newpop = SomeFunctionOfTheStates[#,pdiv,plife]&/@pop;
  Return[(Join[{#[[1]], #[[2]]}, {Mean[U /@ #], Last[#]} &[NestList[MCMCEvo[#, 0.1] &, If[#[[2]] == 0, RandomReal[{0, 100}], #[[4]]],100]]])& /@ newpop]
]

Then I generate 2 populations, one with 100 elements and one with 400 to perform some tests:

pop100 = Table[{1, 0, 0, RandomReal[{1, 100}]}, {i, 100}];
pop400 = Table[{1, 0, 0, RandomReal[{1, 100}]}, {i, 400}];

and try parallelized and unparallelized calculations (4-Cores i5-2.6GHz):

AbsoluteTiming[PopulationEvolve[pop400, 0.8, 1]][[1]]
AbsoluteTiming[Table[PopulationEvolve[pop100, 0.8, 1], {i, 4}]][[1]]
AbsoluteTiming[ParallelTable[PopulationEvolve[pop100, 0.8, 1], {i, 4}]][[1]]
0.952207
0.960080
1.088009

and these proportions doesn't change even if I make PopulationEvolve perform 1000 time steps instead of 100: in principle I would expect the ratios to change in favor of the parallelized version, since the ratio between data exchanged between the kernels and length of the computation changes.

(*With 1000 time steps*)
AbsoluteTiming[PopulationEvolve[pop400, 0.8, 1]][[1]]
AbsoluteTiming[Table[PopulationEvolve[pop100, 0.8, 1], {i, 4}]][[1]]
AbsoluteTiming[ParallelTable[PopulationEvolve[pop100, 0.8, 1], {i, 4}]][[1]]
9.723611
9.816607
10.676737

I'm not very experienced with parallel calculations but, since now, once I was sure that there was no interactions among the parallel calculations the only bottlenecks I've ever found were about passing too much data back and forth to the kernels, which doesn't seem to be the case in this example.

What am I missing here and what would be a good way to parallelize this?

CupiDio
  • 439
  • 2
  • 9
  • 2
    I assume you've check kernel count, and used DistributeDefinitions if needed so all kernels "know" about functions you've defined? – ciao Feb 27 '14 at 12:02
  • @rasher I assumed ParallelTable did that automatically, but I tried distributing them explicitly and nothing changes. I'm monitoring kernels activity with Parallel Kernel Status and it is even showing a factor 3.5 speedup, which is not at all reflected by the timings... – CupiDio Feb 27 '14 at 12:07
  • If I have time when at real machines, I'll give your code a try, but as a rule I don't "play" on them. However, be sure that you'll get answers soon, there are a few heavies here with parallel/HPC experience that post often. – ciao Feb 27 '14 at 12:25
  • I get timings 1, 1, 0.68 on a 4-core i7 with 4 subkernels (not 8 that would be launched by default), I'll take a detailed look a few hours later. – Szabolcs Feb 27 '14 at 15:19
  • I've tried with 2 subkernels (I have 2 physical cores) and timings are exactly the same as with 4. Also I get analogous timings (a bit lower but with the same proportions) on a different 4-cores linux machine (tests from the question are performed on a Mac). I can give you some SystemInformation[] outputs if those can help. – CupiDio Feb 27 '14 at 15:47
  • @CupiDio No need. I can reproduce the problem, and it look very weird indeed. It turns out that evaluating your function on a single subkernel and measuring the time on the subkernel takes almost twice as long as evaluating on the main kernel. I see no reason why this should be so. Example: ParallelEvaluate[ AbsoluteTiming[PopulationEvolve[pop400, 0.8, 1]][[1]], First@Kernels[]] takes 1.95 s while AbsoluteTiming[PopulationEvolve[pop400, 0.8, 1]][[1]] take 1.06 s. Note to anyone trying this: ParallelEvaluate does not automatically DistributeDefinitions! – Szabolcs Feb 27 '14 at 18:23
  • This reminds me of this weirdness: http://mathematica.stackexchange.com/a/25268/12 In some cases evaluating things on a subkernel might be much slower than evaluating the same on the main kernel, for reasons that are a complete mystery to me. – Szabolcs Feb 27 '14 at 18:24
  • @CupiDio What I can say is that there's nothing you are doing wrong here, at least I cannot see anything. My suspicion is that there's some part of your code that just evaluates slower on subkernels (even when only one subkernel is used), just like in the post about Rule I linked to above. I find this a bit disturbing ... You'd need to take your code apart and figure out which part is slow (or just figure out if my suspicion is true). You'd try subexpressions like RandomVariate, InterpolatingFunction, etc. and try each on its own. If you identify the offending part you may be ... – Szabolcs Feb 27 '14 at 18:41
  • ... able to replace it with something more efficient. I don't want to do all this because it's a lot of time and a lot of work. You might not want to do it either, depending on how useful this would be to you ... it's up to you. – Szabolcs Feb 27 '14 at 18:42
  • Okay, I believe I identified the culprit: it's calling interpolating functions. It's just slow in subkernels, very similar to the problem with Rule. – Szabolcs Feb 27 '14 at 18:47
  • Please check my updated answer. – Szabolcs Mar 04 '14 at 23:13

1 Answers1

10

Update on 2014-03-04:

This is the same problem that's described here. It is also mentioned in the documentation: the the third "possible issue" here. For more details check the these two links.

In short, the cause of the problem is that ifun will have bad performance on the subkernels and the workaround is this: DistributeDefinitions[ifun]; ParallelEvaluate[ifun = ifun;]

I'll leave the old answer below.


This is not a solution to the problem, nor a full answer, just tracking down the cause of the slowdown. Personally I believe it to be due to a bug.


This reminded me of a problem where Rule (which is supposed to be an inert head) seems to be evaluated much slower on subkernels than on the main kernel. I don't know why this happens, it's all very mysterious and slightly disturbing.

Generally, I would expect

AbsoluteTiming[expr]

and

ParallelEvaluate[AbsoluteTiming[expr], First@Kernels[]]

to take the same time to evaluate. In both cases we're measuring only the evaluation time of expr, not any time spent communicating between kernels. In both cases we're evaluating the same expression on a single kernel. If expr contains lots of Rules, the subkernel evaluation will take much longer.

If expr is the invocation of an interpolating function, it also takes much longer on a subkernel. Here's a test case:

First, make an interpolating function:

ifun = Interpolation[Accumulate@RandomVariate[NormalDistribution[], 100000]];

Launch a single subkernel and DistributeDefinitions. (Note that ParallelEvaluate doesn't automatically DistributeDefinitions.)

LaunchKernels[1]
DistributeDefinitions[ifun]

Main kernel evaluation:

AbsoluteTiming@Do[ifun[RandomReal[{1, 100000}]], {10000}]

(* ==> {0.038929, Null} *)

Subkernel evaluation

ParallelEvaluate@AbsoluteTiming@Do[ifun[RandomReal[{1, 100000}]], {10000}]

(* ==> {{4.127681, Null}} *)

The latter one is much slower!

Can anyone explain this behaviour?

Szabolcs
  • 234,956
  • 30
  • 623
  • 1,263
  • A wild guess: the interpolation function is not properly distributed. – Ajasja Feb 27 '14 at 20:56
  • @Ajasja If instead of using RandomReal I evaluate something deterministic and compare the results from the main kernel and the subkernel, they're the same. So at least enough is distributed that the interpolating function is working. To make sure that the result computed on the subkernel is correct (and it doesn't just return something that will evaluate further on the main kernel), I checked using something like ParallelEvaluate[Print[expr]]. – Szabolcs Feb 27 '14 at 21:06
  • I encountered similar behaviour in late 2010 using version 7. I asked WRI tech support about this (and other things). Just reading the emails it appears that the reason for this slowness was not answered directly -- and I know I eventually abandoned parallization due to a work time constraint and stuck with single kernel. I also note that WRI advised at the time that "Also, please note that in version 8, we don't need to perform distribute definitions" – Mike Honeychurch Feb 27 '14 at 21:07
  • @Mike Thanks for looking it up! Regarding DistributeDefinitions, functions like ParallelDo and ParallelTable do it automatically (in versions >= 8), but ParallelEvaluate is special in that it doesn't do this. – Szabolcs Feb 27 '14 at 21:10
  • @Szabolcs I was using ParallelMap and ParallelTable so I guess that is the context in which the DistributeDefinitions statement was made to me. – Mike Honeychurch Feb 27 '14 at 21:30
  • @MikeHoneychurch personally I would not say the automatic definition distribution is reliable enough to bother with. It is true that these functions at least attempt to distribute definitions automatically, but it's easy to confuse them into distributing too much or too little. In my opinion it was a mistake to try to automate this as parallelization really does need to be under the full control of the user. – Oleksandr R. Feb 28 '14 at 00:40
  • @OleksandrR. It's part of the "one function does all" philosophy. I think the goal was to let Parallelize just do everything. I also don't think that this is the right way to go. While simple in theory, I mostly abandoned Parallelize in favour of more specific functions and we often get question here about why Parallelize doesn't "just work" on some arbitrary piece of code. – Szabolcs Feb 28 '14 at 00:43
  • 1
    Per your latest revision: I thought that was the reason (I don't know why I didn't mention it earlier; sorry). CompiledFunctions that call a LibraryFunction are also subject to this problem as the LibraryFunction isn't LibraryFunctionLoaded on each subkernel as a result of the DistributeDefinitions. Another example I suppose of why trying to distribute definitions automatically is often going to end in miserable failure. – Oleksandr R. Mar 04 '14 at 23:55
  • I'll try this as soon as I have time, thanks! – CupiDio Mar 06 '14 at 22:19