Starting with a vector of matrices, tVec, I would like to parallelize the multiplication of each element with the same matrix tMat. However, employing Parallelize yields by far slower results than the plain calculation with a single kernel. There is a minimal working example below.
Does anybody have hints on how to improve the situation, and on how to resolve this and thus efficiently use all kernels for evaluation?
dim = 24;
tVec = Table[RandomReal[{}, {dim, dim}], {i, 1, 100000}];
tMat = RandomReal[{}, {dim, dim}];
DistributeDefinitions[tMat, tVec];
Timing[(tMat . #) & /@ tVec;]
(*
>>> {1.08807,Null}
*)
Timing[Parallelize[(tMat . #) & /@ tVec;]]
(*
>>> {11.6207,Null}
*)
Timinghere: it's not accurate for parallel calculations. UseAbsoluteTiming. – Szabolcs May 06 '13 at 13:21Transpose[tMat.Transpose[tVec]]– Simon Woods May 06 '13 at 15:25AbsoluteTiming[Parallelize[(#) & /@ tVec;]], which is little worse. On another note,DistributeDefinitionsis unnecessary here, asParallelizealready does it. For reasons I don't understand, NOT distributingtVecgives a 15-20% speedup on the second timing on my machine. – Tobias Hagge May 06 '13 at 15:26Parallelizecan't be very intelligent at making decisions how to distribute work in general. Situation is strongly amplified by the fact that modern CPUs have relatively high interprocessor communication latencies, and those show up if you naively parallelize tasks involving relatively short computation, like multiplication described above. Longer the individual task and smaller the data involved the better, or just stick with the libraries. – kirma May 06 '13 at 18:22