Original Example
Consider function f, a parallelized version fPar, and a coarsest-grained parallelized version fParCG below.
f[l_] := Map[Function[x, x[[#]] & /@ ConstantArray[Range[l], l]],
Permutations[Range[l]]]
fPar[l_] := ParallelMap[Function[x, x[[#]] & /@ ConstantArray[Range[l], l]],
Permutations[Range[l]]]
fParCG[l_] := ParallelMap[Function[x, x[[#]] & /@ ConstantArray[Range[l], l]],
Permutations[Range[l]], Method -> "CoarsestGrained"]
The functions have the same output, which is just a list containing l copies of every permutation on Range[l].
f[3] // Column
(*
{{1,2,3},{1,2,3},{1,2,3}}
{{1,3,2},{1,3,2},{1,3,2}}
{{2,1,3},{2,1,3},{2,1,3}}
{{2,3,1},{2,3,1},{2,3,1}}
{{3,1,2},{3,1,2},{3,1,2}}
{{3,2,1},{3,2,1},{3,2,1}}
*)
I was surprised to see the parallelized versions are both slower.
f[9] // MaxMemoryUsed // AbsoluteTiming
(* {1.38304, 496422488} *)
fPar[9] // MaxMemoryUsed // AbsoluteTiming
(* {2.81347, 504604072} *)
fParCG[9] // MaxMemoryUsed // AbsoluteTiming
(* {2.46533, 561971768} *)
What in particular makes f not well-parallelizable?
There seems to be little overhead and the computations are independent. Function f is of the form Map[A,B] where each application of A to an element of B takes the same amount of time and the computations can be split equally, easily, and independently into different kernels. This is why I was expecting at least the coarsest grained version to perform better.
Notes
- Yes, I have read Why won't Parallelize speed up my code?. I am wondering what principle from the answer to that question my function
fviolates such that it is not apt for parallelization. - Secondly, I am not looking for a more efficient form of
f. Functionfis an inane way of generating its output. I am wondering what makesf, as it is, not well-parallelizable.
Another Example
Courtesy of Michael E2 in the comments...
Table[p, {p, Permutations[Range[9]]}]; // AbsoluteTiming
(*{0.056542, Null}*)
ParallelTable[p, {p, Permutations[Range[9]]}]; // AbsoluteTiming
({4.74558, Null})
This disparity in speed is troubling to me. (As noted in the accepted answer, ParallelTable[] unpacks here, whereas Table[] does not. This still troubles me.)
MapinsideMap. It is bad practice.. – Rom38 Jul 17 '20 at 06:59MaxMemoryUsed[]). Nevertheless, are you saying the problem withfis thatMap[]is called inside the first argument? – Just Some Old Man Jul 17 '20 at 07:04Parallelizeand friends as so very inefficient at times. Here I think the issue is that you generate (and destroy and copy) bazillions of copies ofConstantArray[Range[l], l]]which, as Rom38 said, is a memory bound operation. Please note thatMaxMemoryUseddoes in general not show the full amount of memory that is used during computations, in particular if some intermediate operations are delegated to compiled libraries. And we have to assume that such a delegation is done by built-in functions. – Henrik Schumacher Jul 17 '20 at 09:15Map[A,B]. It seems to me the parts of B being operated on can be split cleanly, equally, and independently between n kernels without virtually any overhead. – Just Some Old Man Jul 17 '20 at 18:42On["Packing"], is thatParallelMapunpacks the array generated byPermutations, so that definitely counts against efficiency. This probably happens as part of the data transfer process. – Sjoerd Smit Jul 17 '20 at 21:18fdoes not unpack, but the parallelized versions do. That is a surprise to me, and I did not considerParallelMap[]unpacking something thatMap[]does not. I wish this was mentioned in "Why won't Parallelize speed up my code?". To be honest, I think that is important to note. If you want to put what you said in an answer, I would be glad to accept it. – Just Some Old Man Jul 17 '20 at 21:29fandfPar. – Michael E2 Jul 17 '20 at 22:09ParallelMapand have the subarrays be sent to the subkernels as packed arrays? YetfPar[9]returns a packed array. I wonder if WRI would really design it this way. – Michael E2 Jul 17 '20 at 22:27ParallelTable[p, {p, Permutations[Range[9]]}]; // AbsoluteTiming. – Michael E2 Jul 17 '20 at 22:36