Dramatic increase in memory consumption and performance drop with ParallelMap

Question

Consider the following toy problem:

Q = 10^9;
A = Table[RandomInteger[10], {Q}];
Developer`PackedArrayQ@A
B = Map[N[Sin[#]] &, A]; // AbsoluteTiming
Developer`PackedArrayQ@B
MemoryInUse[]
MaxMemoryUsed[]

True
{105.901936, Null}
True
16022521160
24022519456

But using ParallelMap even with custom memberQ gives the following:

Q = 10^9;
A = Table[RandomInteger[10], {Q}];
Developer`PackedArrayQ@A
withModifiedMemberQ[expr_] := 
  Module[{doneQ, unmatchable}, 
   Internal`InheritedBlock[{MemberQ}, Unprotect[MemberQ];
    (*Can uncomment this if we want to print out the MemberQ calls:mq:
    MemberQ[args___]/;(Print@HoldForm[mq];True):=mq;*)
    MemberQ[list_, patt_Symbol, args___] /; ! TrueQ[doneQ] := 
     Block[{doneQ = True}, 
      MemberQ[Unevaluated[
         list] /. _List?Developer`PackedArrayQ -> {unmatchable}, 
       Unevaluated[patt], args]];
    Protect[MemberQ];
    expr]];
SetAttributes[withModifiedMemberQ, HoldAllComplete];
B = withModifiedMemberQ@
    ParallelMap[N[Sin[#]] &, A]; // AbsoluteTiming
Developer`PackedArrayQ@B
MemoryInUse[]
MaxMemoryUsed[]

True
{533.782398, Null}
True
24027869336
48030873944

We see: 5x drop in performance, 2x increase in max memory usage. Why is it happening? How can it be avoided, while keeping the computation parallelized?

Edit: the code of the real life example can be found here. Same problems observed.

Take a look at this link. A short workaround is at the end of the question under "Solution" and a detailed explanation is in the answer. Let me know if this is not what's causing the problem in your case. — Szabolcs, Jun 17 '14 at 02:12
@Szabolcs I've spent this night checking that withModifiedMemberQ@ParallelMap does not help, nor does fix@ParallelMap. — Yasha Gindikin, Jun 17 '14 at 07:41
@Szabolcs I have edited the post to make clear that the custom MemberQ is not of any help here. — Yasha Gindikin, Jun 17 '14 at 08:26
Having checked a few things, my suspicion is that this is simply one of those cases in which the distribution overhead dominates the calculation time, and so for which one cannot achieve any meaningful performance improvement by running in parallel. Remember that MathLink isn't exactly fast for transferring large amounts of numerical data. But this isn't necessarily a robust conclusion; I didn't find the definitive cause, but only eliminated a few likely ones. — Oleksandr R., Jun 17 '14 at 11:31
@OleksandrR. I was thinking about the parallelization overhead, hoping that a real life example with time-consuming functions instead of Sin[] would reveal the advantages of ParallelMap. However, all I observed was appr. 4x drop in performance and memory consumption as compared to Map[]. The code of the real-life example I am talking about is here: http://goo.gl/XlheF9 — Yasha Gindikin, Jun 17 '14 at 12:37
I see... well, (a) parallelized compiled functions do not have to distribute anything and yet they are still parallelized, so ParallelMap will not likely beat them, and (b) there are certain problems with distributing compiled-from-C CompiledFunctions to parallel kernels anyway--namely that you need to LibraryFunctionLoad the associated LibraryFunction, otherwise it just executes in the VM. In summary your real scenario has quite significant differences from your test case. Also it seems that HamEEOffDiag makes external calls, which is extremely undesirable, ... — Oleksandr R., Jun 17 '14 at 18:24
... especially since, in the parallel kernels, it will result in distribution of unnecessary extra data, or failure to execute correctly at all. — Oleksandr R., Jun 17 '14 at 18:26
@OleksandrR. Thank you for your valuable comments! a) Still ParallelTable, if used instead of ParallelMap, gives a huge performance advantage over the sequential evaluation that relies only on the built-in parallelization of compiled functions. Unfortunately, the ParallelTable memory consumption becomes simply unacceptable as the matrix size grows, probably because of the necessity to distribute all the data across the kernels. So, it is not a real alternative, alas. — Yasha Gindikin, Jun 17 '14 at 19:41
@OleksandrR. b) Should I perform LibraryFunctionLoad in every kernel when using the compiled-to-c function? Never heard about it before... And c) I purposely retain the external calls to uncompiled functions because they are capable of memoization, while compiled ones are not. I wish I could do without them, but, pitifully, have got no idea how to do that. — Yasha Gindikin, Jun 17 '14 at 19:42
Sorry, I hadn't noticed that you were memoizing, or that the external calls make use of NIntegrate (I only briefly looked at the bytecode). Now I think it's always going to be rather slow, but that ParallelMap is much slower than ParallelTable is quite interesting. I would have to look at this more closely to find out the reasons for it. You might try explicitly distributing EE and ee, plus the parameters L2 and L3. I suspect that the kernels are only partly evaluating the expression due to undefined symbols, and the (large) result is being finished on the main kernel. — Oleksandr R., Jun 17 '14 at 22:08
@OleksandrR. At Lambda=40, ParallelMap takes 456s, ParallelTable — 132s (but look at the memory usage in the system monitor!). Funny, but sequential Map takes 133s, without eating all the memory. Seems to be an ideal solution in my case. Screenshots: http://goo.gl/rjdnp3 http://goo.gl/wFQd6e http://goo.gl/YWuEe2 — Yasha Gindikin, Jun 18 '14 at 06:07
@OleksandrR. Also, DistributeDefinitions do not seem to have any effect. I assume, they might be applied automatically since v8. — Yasha Gindikin, Jun 18 '14 at 06:09

Dramatic increase in memory consumption and performance drop with ParallelMap

0 Answers0