Parallelizing is much slower than single kernel evaluation?

Question

Starting with a vector of matrices, tVec, I would like to parallelize the multiplication of each element with the same matrix tMat. However, employing Parallelize yields by far slower results than the plain calculation with a single kernel. There is a minimal working example below.

Does anybody have hints on how to improve the situation, and on how to resolve this and thus efficiently use all kernels for evaluation?

dim = 24;
tVec = Table[RandomReal[{}, {dim, dim}], {i, 1, 100000}];
tMat = RandomReal[{}, {dim, dim}];

DistributeDefinitions[tMat, tVec];

Timing[(tMat . #) & /@ tVec;]
(*
>>> {1.08807,Null}
*)

Timing[Parallelize[(tMat . #) & /@ tVec;]]
(*
>>> {11.6207,Null}
*)

Related: http://mathematica.stackexchange.com/q/2886/12 The fixes there give a 2x speedup here, but the computation is still slower than the non-parallel one. Also, do not use Timing here: it's not accurate for parallel calculations. Use AbsoluteTiming. — Szabolcs, May 06 '13 at 13:21
As I understand it, machine precision linear algebra is handled by a highly optimised multi-threaded library which will automatically distribute the work across multiple CPU cores if appropriate. I don't think there's anything to gain by using parallel Mathematica kernels. I get a slight speed-up (20% ish) by writing the code as Transpose[tMat.Transpose[tVec]] — Simon Woods, May 06 '13 at 15:25
Compare performance of AbsoluteTiming[Parallelize[(#) & /@ tVec;]], which is little worse. On another note, DistributeDefinitions is unnecessary here, as Parallelize already does it. For reasons I don't understand, NOT distributing tVec gives a 15-20% speedup on the second timing on my machine. — Tobias Hagge, May 06 '13 at 15:26
As Simon Woods stated above, highly optimized multi-threaded libraries are included for many tasks in Mathematica. Mma is a complex system, and Parallelize can't be very intelligent at making decisions how to distribute work in general. Situation is strongly amplified by the fact that modern CPUs have relatively high interprocessor communication latencies, and those show up if you naively parallelize tasks involving relatively short computation, like multiplication described above. Longer the individual task and smaller the data involved the better, or just stick with the libraries. — kirma, May 06 '13 at 18:22

Parallelizing is much slower than single kernel evaluation?

0 Answers0