Consider the following toy problem:
Q = 10^9;
A = Table[RandomInteger[10], {Q}];
Developer`PackedArrayQ@A
B = Map[N[Sin[#]] &, A]; // AbsoluteTiming
Developer`PackedArrayQ@B
MemoryInUse[]
MaxMemoryUsed[]
True
{105.901936, Null}
True
16022521160
24022519456
But using ParallelMap even with custom memberQ gives the following:
Q = 10^9;
A = Table[RandomInteger[10], {Q}];
Developer`PackedArrayQ@A
withModifiedMemberQ[expr_] :=
Module[{doneQ, unmatchable},
Internal`InheritedBlock[{MemberQ}, Unprotect[MemberQ];
(*Can uncomment this if we want to print out the MemberQ calls:mq:
MemberQ[args___]/;(Print@HoldForm[mq];True):=mq;*)
MemberQ[list_, patt_Symbol, args___] /; ! TrueQ[doneQ] :=
Block[{doneQ = True},
MemberQ[Unevaluated[
list] /. _List?Developer`PackedArrayQ -> {unmatchable},
Unevaluated[patt], args]];
Protect[MemberQ];
expr]];
SetAttributes[withModifiedMemberQ, HoldAllComplete];
B = withModifiedMemberQ@
ParallelMap[N[Sin[#]] &, A]; // AbsoluteTiming
Developer`PackedArrayQ@B
MemoryInUse[]
MaxMemoryUsed[]
True
{533.782398, Null}
True
24027869336
48030873944
We see: 5x drop in performance, 2x increase in max memory usage. Why is it happening? How can it be avoided, while keeping the computation parallelized?
Edit: the code of the real life example can be found here. Same problems observed.
ParallelMapwill not likely beat them, and (b) there are certain problems with distributing compiled-from-CCompiledFunctionsto parallel kernels anyway--namely that you need toLibraryFunctionLoadthe associatedLibraryFunction, otherwise it just executes in the VM. In summary your real scenario has quite significant differences from your test case. Also it seems thatHamEEOffDiagmakes external calls, which is extremely undesirable, ... – Oleksandr R. Jun 17 '14 at 18:24NIntegrate(I only briefly looked at the bytecode). Now I think it's always going to be rather slow, but thatParallelMapis much slower thanParallelTableis quite interesting. I would have to look at this more closely to find out the reasons for it. You might try explicitly distributingEEandee, plus the parametersL2andL3. I suspect that the kernels are only partly evaluating the expression due to undefined symbols, and the (large) result is being finished on the main kernel. – Oleksandr R. Jun 17 '14 at 22:08