Why Does Parallelization Not Speed Up these Seemingly Well-Parallelizable, Simple Functions?

Question

Original Example

Consider function f, a parallelized version fPar, and a coarsest-grained parallelized version fParCG below.

f[l_] := Map[Function[x, x[[#]] & /@ ConstantArray[Range[l], l]],
  Permutations[Range[l]]]
fPar[l_] := ParallelMap[Function[x, x[[#]] & /@ ConstantArray[Range[l], l]],
  Permutations[Range[l]]]
fParCG[l_] := ParallelMap[Function[x, x[[#]] & /@ ConstantArray[Range[l], l]],
  Permutations[Range[l]], Method -> "CoarsestGrained"]

The functions have the same output, which is just a list containing l copies of every permutation on Range[l].

f[3] // Column
(*
{{1,2,3},{1,2,3},{1,2,3}}
{{1,3,2},{1,3,2},{1,3,2}}
{{2,1,3},{2,1,3},{2,1,3}}
{{2,3,1},{2,3,1},{2,3,1}}
{{3,1,2},{3,1,2},{3,1,2}}
{{3,2,1},{3,2,1},{3,2,1}}
*)

I was surprised to see the parallelized versions are both slower.

f[9] // MaxMemoryUsed // AbsoluteTiming
(* {1.38304, 496422488} *)
fPar[9] // MaxMemoryUsed // AbsoluteTiming
(* {2.81347, 504604072} *)
fParCG[9] // MaxMemoryUsed // AbsoluteTiming
(* {2.46533, 561971768} *)

What in particular makes f not well-parallelizable?

There seems to be little overhead and the computations are independent. Function f is of the form Map[A,B] where each application of A to an element of B takes the same amount of time and the computations can be split equally, easily, and independently into different kernels. This is why I was expecting at least the coarsest grained version to perform better.

Notes

Yes, I have read Why won't Parallelize speed up my code?. I am wondering what principle from the answer to that question my function f violates such that it is not apt for parallelization.
Secondly, I am not looking for a more efficient form of f. Function f is an inane way of generating its output. I am wondering what makes f, as it is, not well-parallelizable.

Another Example

Courtesy of Michael E2 in the comments...

Table[p, {p, Permutations[Range[9]]}]; // AbsoluteTiming
(*{0.056542, Null}*)
ParallelTable[p, {p, Permutations[Range[9]]}]; // AbsoluteTiming
({4.74558, Null})

This disparity in speed is troubling to me. (As noted in the accepted answer, ParallelTable[] unpacks here, whereas Table[] does not. This still troubles me.)

Parallelization always require more memory because each used variable will be duplicated for parallel kernels. So, you pay memory consumption for almost proportional speedup. You have used Map inside Map. It is bad practice.. — Rom38, Jul 17 '20 at 06:59
@Rom38 If my memory serves me right, that has not been my experience. Though I have encountered many cases where parallelization consumes more memory, I have also encountered many cases where parallelization consumes less memory (as measured by MaxMemoryUsed[]). Nevertheless, are you saying the problem with f is that Map[] is called inside the first argument? — Just Some Old Man, Jul 17 '20 at 07:04
To be honest, it is a mystery also to me why Parallelize and friends as so very inefficient at times. Here I think the issue is that you generate (and destroy and copy) bazillions of copies of ConstantArray[Range[l], l]] which, as Rom38 said, is a memory bound operation. Please note that MaxMemoryUsed does in general not show the full amount of memory that is used during computations, in particular if some intermediate operations are delegated to compiled libraries. And we have to assume that such a delegation is done by built-in functions. — Henrik Schumacher, Jul 17 '20 at 09:15
The operation you're parallelizing is simply not expensive enough to warrant the expense of shuffling all this data around between the master and slave kernels. I don't think there's more to it than that. Parallelizing to slave kernels is worthwhile mostly for expensive functions that take small inputs and generate small outputs. — Sjoerd Smit, Jul 17 '20 at 09:20
@SjoerdSmit I don't see the expense you are seeing. See where I mention the function is of the form Map[A,B]. It seems to me the parts of B being operated on can be split cleanly, equally, and independently between n kernels without virtually any overhead. — Just Some Old Man, Jul 17 '20 at 18:42
@JustSomeOldMan Yes, but the inputs/outputs still need to be communicated between the master and slave kernels. There is still overhead even for simple maps like these. Another thing you might notice if you use On["Packing"], is that ParallelMap unpacks the array generated by Permutations, so that definitely counts against efficiency. This probably happens as part of the data transfer process. — Sjoerd Smit, Jul 17 '20 at 21:18
@SjoerdSmit Thank you, that seems to be the big cause in my mind. You're right, the unparallelized version f does not unpack, but the parallelized versions do. That is a surprise to me, and I did not consider ParallelMap[] unpacking something that Map[] does not. I wish this was mentioned in "Why won't Parallelize speed up my code?". To be honest, I think that is important to note. If you want to put what you said in an answer, I would be glad to accept it. — Just Some Old Man, Jul 17 '20 at 21:29
@SjoerdSmit The unpacking happens only one and costs 0.05 sec? On my machine it does and does not account for the difference in speed between f and fPar. — Michael E2, Jul 17 '20 at 22:09
@MichaelE2 If I understood correctly, sending large amounts of unpacked data is very inefficient (as noted in https://mathematica.stackexchange.com/questions/48295/why-wont-parallelize-speed-up-my-code). So the unpacking itself may not be the worst, but it does slow everything that comes after down. — Sjoerd Smit, Jul 17 '20 at 22:12
@SjoerdSmit It's conceivable. You're suggesting there's no way to use a packed array as the second argument of ParallelMap and have the subarrays be sent to the subkernels as packed arrays? Yet fPar[9] returns a packed array. I wonder if WRI would really design it this way. — Michael E2, Jul 17 '20 at 22:27
Maybe you're right. This seems pretty sorry: ParallelTable[p, {p, Permutations[Range[9]]}]; // AbsoluteTiming. — Michael E2, Jul 17 '20 at 22:36

Sjoerd Smit · Accepted Answer · 2020-07-17T22:16:12.520

As I noted in a comment, it seems that ParallelMap unpacks packed arrays when sending the data to the slave kernels:

data = Permutations[Range[9]];
Developer`PackedArrayQ[data]

True

This simple Map will not generate any messages about packing:

On["Packing"];
Map[Total, data];

(* No messages *)

ParallelMap[Total, data]

(* generates Developer`FromPackedArray::unpack message *)

Unpacking of arrays is most likely a significant source of slowdown in parallel maps like this one since sending unpacked data is much slower according to this answer

Edit

Actually, item 3.4 in this answer does mention this problem to some degree and also links to a solution for the reverse problem when the values returned by parallel operations are packed arrays. At any rate, it's good advice to track the packing behavior of your computation when using parallel operations.

Why Does Parallelization Not Speed Up these Seemingly Well-Parallelizable, Simple Functions?

Original Example

Notes

Another Example

1 Answers1

Edit

Linked