Making Monte Carlo simulation with ParallelTable

Question

I would like to to MonteCarlo simulations of various regressions - basically, doing OLS under various conditions to test approaches.

On my macpro 2x 2.8 GHz quad core Xeon, 32 Gb RAM DDR2 @ 667 Mhz, the baseline is 56 seconds for 10'000 lines and 6'000 iterations.

ols = {};
beta = {1, 2, 3};
size = 10000;
iterations = 6000;
out = AbsoluteTiming[
   Table[x1 = 
     Table[{RandomVariate[NormalDistribution[0, 1]], 
       RandomVariate[NormalDistribution[0, 1]]^2, 
       RandomVariate[NormalDistribution[0, 1]]}, {size}];
    u1 = Table[RandomVariate[NormalDistribution[0, 1]], {size}];
    y1 = x1.beta + u1;
    LeastSquares[x1, y1], {iterations}]];
out[[1]]
out[[2]] // Dimensions

51.895992

{6000, 3}

Simply by switching to ParallelTable, I get down to about 12.5 seconds

After reading Transferring a large amount of data in parallel calculations, I updated my code to use the suggested modifications :

a modified MemberQ that doesn't do the unpacking (goes down to 12.38 seconds)
Method-> "CoarsestGrained" (goes down to 11.9 seconds)
both (goes down to 11.9 seconds)

$IterationLimit = 100000;
out = AbsoluteTiming@withModifiedMemberQ@ParallelTable[
(...)
     , {iterations}, Method -> "CoarsestGrained"];

I am wondering what I should do to for further performance gains. I'd like to do 10'000 lines and 10'000 iterations in less than 10 seconds if possible.

Performance tip: Table[RandomVariate[NormalDistribution[0, 1]], {size}] should be RandomVariate[NormalDistribution[0,1], size]. There's a significant overhead when calling RandomVariate. Try to generate as many numbers with one call as possible. — Szabolcs, Mar 28 '14 at 19:02
+1 for clear, minimal example and for doing prior research. I wish every question were written like this one. — Szabolcs, Mar 28 '14 at 19:04
I concur with @Szabolcs. When doing MC you should always try to think (re-think) of your problem in a way that allows you to generate all random numbers in one list and then think about how to test or scan the list. Performance improvements can be 1, 2 or 2+ orders of magnitude. — Mike Honeychurch, Mar 28 '14 at 23:27

Szabolcs · Answer 1 · 2014-03-28T23:12:50.957

The key to speedup here is generating the set of random number in one go instead of calling RandomVariate repeatedly.

Generally, instead of

Table[RandomVariate[...], {size}]

use

RandomVariate[..., size]

It can also generate a multidimensional array of random values in one go.

I rewrote your code to do this:

out = AbsoluteTiming[Table[
    Module[{x1, u1, y1},
     x1 = Transpose[{#1, #2^2, #3} & @@ RandomVariate[NormalDistribution[0, 1], {3, size}]];
     u1 = RandomVariate[NormalDistribution[0, 1], size];
     y1 = x1.beta + u1;
     LeastSquares[x1, y1]
    ], {iterations}]];

The original code also runs in ~56 seconds on my machine (4-core i7 laptop CPU). The modified one runs in ~17-18 seconds.

By changing Table to ParallelTable the timing drops to about 4 seconds when using 4 kernels, and to about 3 seconds when using 8 kernels (the default due to hyperthreading).

I'm quite confused by this result, and I have no idea why running on 4 kernels gives a speedup of more than 4x (!!), from 17-18 down to 4. The results seem to be correct.

Can anyone explain the mystery?

Example of the speedup:

In[24]:= Table[
 4 First@AbsoluteTiming[
    ParallelTable[
     Module[{x1, u1, y1}, 
      x1 = Transpose[{#1, #2^2, #3} & @@ 
         RandomVariate[NormalDistribution[0, 1], {3, size}]];
      u1 = RandomVariate[NormalDistribution[0, 1], size];
      y1 = x1.beta + u1;
      LeastSquares[x1, y1]], {iterations}]], {10}]

Out[24]= {17.97667, 16.76192, 16.86759, 16.63448, 16.93627, 16.67996, 16.63543, 16.73172, 16.73115, 16.81402}

In[25]:= Table[
 First@AbsoluteTiming[
   Table[Module[{x1, u1, y1}, 
     x1 = Transpose[{#1, #2^2, #3} & @@ 
        RandomVariate[NormalDistribution[0, 1], {3, size}]];
     u1 = RandomVariate[NormalDistribution[0, 1], size];
     y1 = x1.beta + u1;
     LeastSquares[x1, y1]], {iterations}]], {10}]

Out[25]= {18.058032, 18.030199, 18.076218, 18.108567, 18.018368, 18.017873, 18.043030, 18.001049, 18.065033, 18.276421}

The parallel version seems to be consistently faster by slightly more than a factor of 4x. I'm not sure what's happening but others seem not to be able to reproduce this.

@Szabolcs I get the right scaling on a mac pro (26-> 7sec). Strangely enough the first (non // run) uses 800 % of the CPU. — chris, Mar 28 '14 at 20:40

Making Monte Carlo simulation with ParallelTable

1 Answers1