Recently I increased the problem size in my code, and it scales up very badly using ParallelTable. With a lot of help from the this community, it seems that the problem maybe caused by the communication of large data between the kernels. So is there a way to monitor the communication between these kernels? For example, to see how much data is copied from one kernel to another.
Edit
As requested, here is a example demonstrate the problem:
Clear["`*"]
LaunchKernels[];
I have 16 kernels on my system
$KernelCount
(* ==> 16 *)
define some matrix
m = 30000; n = 640;
a = RandomComplex[{0., 1. + I}, {n, m}];
b = RandomComplex[{0., 1. + I}, {n, m}];
define some function which does some simple algebraic calculation, the detailed can be ignored.
SelectbyWRange[A_, {WMin_, WMax_}, {TakeWMin_, TakeWMax_}] :=
Module[{lthA, nMax, nMin}, lthA = Length[A];
nMin = Round[-((-WMax + lthA WMin)/(WMax - WMin)) - ((1 -
lthA) TakeWMin)/(WMax - WMin)];
nMax = Round[-((-WMax + lthA WMin)/(WMax - WMin)) - ((1 -
lthA) TakeWMax)/(WMax - WMin)];
Transpose[{Table[
TakeWMin + n*(TakeWMax - TakeWMin)/(nMax - nMin), {n, 0,
nMax - nMin}], Take[A, {nMin, nMax}]}]]
g[{x_, y_}] :=
SelectbyWRange[-Im[x*Conjugate[y]], {-834., 834.}, {19.5, 20.5}]
Timing Table, ParallelTable, Map, ParallelMap:
Table[g[{a[[n]], b[[n]]}], {n, 1, Length[a]}]; // AbsoluteTiming
(* ==> {0.390135, Null} *)
ParallelTable[g[{a[[n]], b[[n]]}], {n, 1, Length[a]}]; // AbsoluteTiming
(* ==> {14.352067, Null} *)
Map[g, Transpose[{a, b}]]; // AbsoluteTiming
(* ==> {1.010789, Null} *)
ParallelMap[g, Transpose[{a, b}]]; // AbsoluteTiming
(* ==> {8.101203, Null} *)
ParallelTable[
g[{RandomComplex[{0., 1. + I}, {m}],
RandomComplex[{0., 1. + I}, {m}]}], {n}]; // AbsoluteTiming
(* ==> {0.128660, Null} *)
We can see that ParallelTable[g[{a[[n]], b[[n]]}], {n, 1, Length[a]}] is worst in timing, this maybe because whole a and b have to be copied to each subKernel and thus took long time. Also in the last ParallelTable, there is no data copying from master kernel to subKernels and it has the best performance. As why Table is so fast and ParallelMap is so slow I have no clue. I thought monitor the communication between the kernels maybe helpful in understanding their behaviors.