Why I only get 2X speedup when parallelized with 16 kernels

Question

I'm trying to parallelize a function by ParallelTable, and it seems the speedup is not very good.

I have 16 parallel kernels on my computer node:

$Version
(*"9.0 for Linux x86 (64-bit) (November 20, 2012)"*)

LaunchKernels[]
(*
{KernelObject[1, "local"], KernelObject[2, "local"], KernelObject[3, "local"],KernelObject[4, "local"], KernelObject[5, "local"], KernelObject[6, "local"], KernelObject[7, "local"], KernelObject[8, "local"], KernelObject[9, "local"], KernelObject[10, "local"], KernelObject[11, "local"], KernelObject[12, "local"], KernelObject[13, "local"], KernelObject[14, "local"], KernelObject[15, "local"], KernelObject[16, "local"]}
*)

And using a documentation example, I get about 16X speedup:

Table[Pause[0.5];f[i],{i,16}]//AbsoluteTiming
(*{8.003956, Null*)

ParallelTable[Pause[0.5];f[i],{i,16}]//AbsoluteTiming
(*{0.508174, Null*)

However, my code only get 2X speedup:

hz = 2*0.375*^-9; c = 2.997924580*^8; ht = hz/(1.01*c)*1.*^15; N0 = 60000; hbar = 1.0545716*^-34; ωx = 2.96265*^1; nx = 5.; Ex = 2.74*^8; Tx = 2 π/ωx; β2 = 226.161; d13 = 8.35066*^-31; Ti = 2.64516;

Ω13[t_, β1_] := Piecewise[{{((Ex*d13)/(hbar*1.*^15))*Cos[(ωx*(t - β1))/(2*nx)]^2* Sin[ωx*(t - β1)], β1 - nx*(Tx/2.) <= t <= β1 + nx*(Tx/2.)}, {0., True}}]

Ω13cf = 
  Compile[{{t, _Real}, {τ1, _Real}}, Evaluate@Ω13[t, τ1], RuntimeAttributes -> {Listable}];

Table[Ω13cf[t, τ1], Evaluate@{τ1, β2 - 80 Ti, β2 + 10 Ti, 0.05 Ti}, Evaluate@{t, ht, ht*N0, ht}] // Developer`PackedArrayQ // AbsoluteTiming
(*{14.805294, True}*)

Table[Ω13cf[t, τ1], Evaluate@{τ1, β2 - 80 Ti, β2 + 10 Ti, 0.05 Ti}, Evaluate@{t, ht, ht*N0, ht}] // Developer`PackedArrayQ // AbsoluteTiming
(*{14.799250, True}*)

ParallelTable[Ω13cf[t, τ1], Evaluate@{τ1, β2 - 80 Ti, β2 + 10 Ti, 0.05 Ti}, Evaluate@{t, ht, ht*N0, ht}] // Developer`PackedArrayQ // AbsoluteTiming
(*{7.714471, True}*)

ParallelTable[Ω13cf[t, τ1], Evaluate@{τ1, β2 - 80 Ti, β2 + 10 Ti, 0.05 Ti}, Evaluate@{t, ht, ht*N0, ht}] // Developer`PackedArrayQ // AbsoluteTiming
(*{3.088646, True}*)

Questions:

Is it possible to get about 16X speedup using ParallelTable, and how?
Why the second ParallelTable get a 2X speedup?

Update

Mr. Wizard suggested a very useful discussion on how to efficiently parallelize the problem. As I understand, the basic idea is to use ParallelMap instead of ParallelTable. So I tried this approach, but it turns out the result is not a PackedArray and ParallelMap more than 10X slower than the un-parallelized Table:

ls = Tuples[{Range[ht, ht*N0, ht],Range[β2 - 80 Ti, β2 + 10 Ti, 0.05 Ti]}];
Developer`PackedArrayQ[ls]
（*True*）

Map[Ω13cf @@ # &, ls] // Developer`PackedArrayQ // AbsoluteTiming
(*{181.030360,False}*)

ParallelMap[Ω13cf @@ # &, ls] // Developer`PackedArrayQ // AbsoluteTiming
(*no return before I kill it after running for 5 minutes*)

Out of curiosity, how many CPU cores does your computer have? — Jonie, Sep 02 '13 at 01:15
@Jonie I was using Mathematica on a HPC, it has two 8-Core Sandy Bridge Xeon 64-bit processors on one node, so that's 16 cores in total. — xslittlegrass, Sep 02 '13 at 01:19
I've tried running it on my work machine, which gives some out of memory issues so I've lowered N0 to 10000. Run times are 8, 8, 35, 30 seconds. So 4x slower on the parallel table (4 cores). I'll try again once I'm home. — Jonie, Sep 02 '13 at 04:49
@Mr.Wizard thanks for the link, it's very useful. I tried the ParallelMap approach but it turns out much slower than the un-parallelized Table version. Please see my updates. Could you give me some suggestions or consider reopen it? — xslittlegrass, Sep 02 '13 at 15:48
First, you can avoid unpacking with Ω13cf[#[[1]], #[[2]]] &. Second, you can take advantage of your Listable (and auto-parallelized) function with something like Ω13cf[#[[1]], #[[2]]] &@ Transpose @ ls, although this seems a tad better: ParallelMap[Ω13cf[Range[ht, ht*N0, ht], #] &, Range[β2 - 80 Ti, β2 + 10 Ti, 0.05 Ti]]. — Michael E2, Sep 02 '13 at 16:29
@MichaelE2 I really like the second point, except it still unpacks the results. ParallelMap[Ω13cf[Range[ht, ht*N0, ht], #] &, Range[β2 - 80 Ti, β2 + 10 Ti, 0.05Ti]]//DeveloperPackedArrayQgives meFalse. It there a way to avoid unpacking? For the first point, If I useΩ13cf[#[[1]], #[[2]]] &, then the definition of the function would be something likeΩ13cf=Compile[{{arg,_Real,1}},Evaluate@Ω13[arg]], but sometimes the function takes different type of arguments, for exampleΩ13[x1_Integer,x2_Real]. In this mixed type case, how to define the compiled functionΩ13cf`? Thanks. — xslittlegrass, Sep 02 '13 at 16:44
I think ParallelMap & ParallelTable yield unpacked arrays, by their nature. You can use Developer`ToPackedArray, but then the time it takes is greater than the Ω13cf[#[[1]], #[[2]]] & method. (This last one works as presented, because of RuntimeAttributes -> Listable; you don't need to redefine it.) For the last question, use a function and patterns to call an appropriate compiled function. — Michael E2, Sep 02 '13 at 17:49
The function withModifiedMemberQ in this answer can be used to speed up the ParallelMap method in my comment. — Michael E2, Sep 02 '13 at 17:52
Just out of curiosity: Are you guys sure that mathematica uses all available kernels or is it in any way restricted to a certain amount of kernels depending on the license you own (home, enterprise, etc.)? — Wizard, Sep 07 '13 at 13:04

score 7 · Accepted Answer · answered Oct 12 '13 at 15:03

I think that part of the reason for you not seeing much of a speed-up is that your function is very computationally light. So, you spend more time moving data between kernels than you spend actually computing.

My suggestion, therefore, is not to use multiple kernels but to use multiple threads via Compile.

Here's your original function running on my quad-core laptop

In[1]:= hz = 2*0.375*^-9; c = 2.997924580*^8; ht = 
 hz/(1.01*c)*1.*^15; N0 = 60000; hbar = 1.0545716*^-34; \[Omega]x = \
2.96265*^1; nx = 5.; Ex = 2.74*^8; Tx = 
 2 \[Pi]/\[Omega]x; \[Beta]2 = 226.161; d13 = 8.35066*^-31; Ti = \
2.64516;

\[CapitalOmega]13[t_, \[Beta]1_] := 
 Piecewise[{{((Ex*d13)/(hbar*1.*^15))*
     Cos[(\[Omega]x*(t - \[Beta]1))/(2*nx)]^2*
     Sin[\[Omega]x*(t - \[Beta]1)], \[Beta]1 - nx*(Tx/2.) <= 
     t <= \[Beta]1 + nx*(Tx/2.)}, {0., True}}]

\[CapitalOmega]13cf = 
  Compile[{{t, _Real}, {\[Tau]1, _Real}}, 
   Evaluate@\[CapitalOmega]13[t, \[Tau]1], 
   RuntimeAttributes -> {Listable}];

original = 
   Table[\[CapitalOmega]13cf[t, \[Tau]1], 
    Evaluate@{\[Tau]1, \[Beta]2 - 80 Ti, \[Beta]2 + 10 Ti, 0.05 Ti}, 
    Evaluate@{t, ht, ht*N0, ht}]; // AbsoluteTiming

Out[4]= {24.171383, Null}

Change the compilation of the function to

In[5]:= \[CapitalOmega]13cf = 
  Compile[{{t, _Real}, {\[Tau]1, _Real}}, 
   Evaluate@\[CapitalOmega]13[t, \[Tau]1], 
   RuntimeAttributes -> {Listable}, Parallelization -> True, 
   CompilationTarget -> "C"];

Perform the calculation like this:

AbsoluteTiming[T1 =
  Table[
   Table[\[Tau]1, {t, ht, ht*N0, ht}]
   , {\[Tau]1, \[Beta]2 - 80 Ti, \[Beta]2 + 10 Ti, 0.05 Ti}];
 T = Table[
   Table[t, {t, ht, ht*N0, ht}]
   , {\[Tau]1, \[Beta]2 - 80 Ti, \[Beta]2 + 10 Ti, 0.05 Ti}
   ];
 faster = \[CapitalOmega]13cf[T, T1];
 ]

That is, I construct Tables of all of the input arguments ahead of time and pass all of them to the compiled function at once. On my quad-core its almost twice as fast so hopefully you'll see better with your 16 core.

Out[6]= {13.280760, Null}

Quick sanity check to make sure that I haven't broken the results:

In[7]:= faster == original

Out[7]= True

This is not going to scale well over more cores. It takes 8 seconds just to construct the lists that form the input arguments:

In[5]:= AbsoluteTiming[
 T1 = Table[
   Table[\[Tau]1, {t, ht, ht*N0, ht}], {\[Tau]1, \[Beta]2 - 
     80 Ti, \[Beta]2 + 10 Ti, 0.05 Ti}];
 T = Table[
   Table[t, {t, ht, ht*N0, ht}], {\[Tau]1, \[Beta]2 - 
     80 Ti, \[Beta]2 + 10 Ti, 0.05 Ti}];
 ]

Out[5]= {8.006458, Null}

You may do better moving the whole thing into Compile but I haven't gone that far.

Why I only get 2X speedup when parallelized with 16 kernels

1 Answers1