My ParallelDo does not work

Question

I have to do a lot of calculations that take lot of time, and using a Do loop is simply too long. It is the first time I am using ParallelDo and it is not working as fast as it should. I am sure it is an easy fix, likely due to how I define the variables (shared, not shared... I cannot understand how to do it right).

My code is the following:

(*Parameter to set the size of the computation. in the final version will be 15*)
Dim = 8;
(*Matrices considered*)
eHe = RandomReal[{-1, 1}, {2^Dim, 2^Dim}];
eHe = eHe + ConjugateTranspose[eHe];
(*Some parameters*)
Nst = Log2[Length[eHe]]
SysDim = N[2^Nst];
(Matrices for the scalar product)
PauliString = {"Id", "X", "Y", "Z"};
S0 = SparseArray[{{1, 1} -> N[1], {2, 2} -> N[1]}];
S1 = SparseArray[{{1, 2} -> N[1], {2, 1} -> N[1]}];
S2 = SparseArray[{{1, 2} -> N[-I], {2, 1} -> N[I]}];
S3 = SparseArray[{{1, 1} -> N[1], {2, 2} -> N[-1]}];
SVec = {S0, S1, S2, S3};
(Parameters for the ParallelDo)
Soglia = 10^-10;
Hper = eHe;
resHper = {};
SetSharedVariable[resHper];
(ParallelDo)
ParallelDo[
tmp = Tr[
   ConjugateTranspose[
     Apply[KroneckerProduct, 
      Table[SVec[[el[[j]]]], {j, 1, Nst}]]].Hper]/SysDim;
If[Abs[tmp] > Soglia,
   AppendTo[resHper,
   Flatten[
    Append[{tmp},
     Table[PauliString[[el[[j]]]], {j, 1, Nst}]
     ]
    ]
   ]
  ]
 , {el, Tuples[{1, 2, 3, 4}, Nst]}]

Here, Hper and Apply[KroneckerProduct,Table[Vec[[el[[j]]]],{j, 1, Nst}]]] are matrices and Table[PauliString[[el[[j]]]] a table of strings. The ParallelDo loop actually works, but is seemingly much slower if compared to a normal Do loop...

I have spent really long time, and have not understood where the problem is. Any help would be really appreciated. Thanks!

UPDATE: The code above now seems to yield the correct result, but is much slower than expected. I have benchmarked the code by substituting ParallelDo with Do and the first takes up to 18 times longer if compared to Do. I have checked the CPU and it is always mostly unused when using the ParallelDo. I guess it is because Mathematica has trouble with the tmp variable inside the ParallelDo?

tmp is a shared variable but you’re setting it in every kernel, so there’s a big problem. — Roman, Dec 06 '22 at 18:22
Yes I fully agree with you... The point is that I do not know how to solve this issue, and wasn't able to find any tutorial. Any guess? — Eurabio, Dec 06 '22 at 18:26
"Help me speed up my code" will probably get you much better answers than "help me parallelize my code". Parallelization of bad code is usually a much less effective strategy than finding a better algorithm. — Roman, Dec 06 '22 at 18:27
I mean, I have to do a lot of matrix multiplications, and everything is sparse and I guess pretty efficient. I do not think (may be clearly wrong) that, within each loop, I can do anything much faster. But the loop is definitely parallelizable, and as such that is seemingly the best route to take. But if you have any other suggestion to make the code faster, I am clearly happy to hear! — Eurabio, Dec 06 '22 at 18:30
Your code does not run. Please make it self-contained, so that people can experiment and help. — Roman, Dec 06 '22 at 18:52
Sorry for that @Roman, I have inserted all the code. As a brief update, now it is seemingly working but, as you said, it does not yield any advantage compared to the simple Do... Comparing the times of run they are pretty much the same... — Eurabio, Dec 06 '22 at 19:16
Might be a question of coarse vs fine graining. If ParallelDo is using a bad choice by default, possibly an explicit setting will help. — Daniel Lichtblau, Dec 06 '22 at 21:24
Thanks for the help @DanielLichtblau, but this is seemingly not the problem. Specifically, it takes MUCH longer to run the ParallelDo then it does with the normal Do. Any guess why is that? I have tried changing the graining, not much canged... — Eurabio, Dec 06 '22 at 22:16
See this: https://mathematica.stackexchange.com/a/138911/12 and also this: https://mathematica.stackexchange.com/q/48295/12 The general advice is to use data parallelism. If you cannot formulate your problem in terms of ParalellTable or similar (more generally: ParallelCombine), it is likely just not a good fit for parallelization. — Szabolcs, Dec 07 '22 at 09:49
Ok this makes total sense... It seems that parallelization is not a good idea in this case, if I want to use Mathematica. Thanks! — Eurabio, Dec 08 '22 at 14:28

user293787 · Accepted Answer · 2022-12-07T05:42:24.297

Part 1. Compare

Tr1[A_,B_]:=Tr[ConjugateTranspose[A].B];
Tr2[A_,B_]:=With[{X=Most[ArrayRules[A]]},Conjugate[X[[;;,2]]].Extract[B,X[[;;,1]]]];

They give the same results but in OP's situation and at least with Dim=12 the second is much faster:

Dim=12;
SeedRandom[1];
A=SparseArray[MapThread[({#1,#2}->RandomComplex[{-1-I,1+I}])&,
     {Range[2^Dim],Permute[Range[2^Dim],RandomPermutation[2^Dim]]}]];
B=RandomReal[{-1,1},{2^Dim,2^Dim}];
AbsoluteTiming[Tr1[A,B]]
(* {0.110279, -4.45864+3.28819 I} *)
AbsoluteTiming[Tr2[A,B]]
(* {0.002091, -4.45864+3.28819 I} *)

For Dim=8 there is little difference, but OP mentions that they are interested in larger Dim.

Part 2. Here is a modification of OP's code.

Instead of shared variables and ParallelDo, it uses ParallelTable.
To collect results locally, it uses Reap-Sow and avoids AppendTo.
Coarse graining.
Some other changes for convenience.

Please restart the kernel before trying this:

(*Parameter to set the size of the computation.in the final version will be 15*)
Dim=8;
(Matrices considered)
eHe=RandomReal[{-1,1},{2^Dim,2^Dim}];
eHe=eHe+ConjugateTranspose[eHe];
(Some parameters)
Nst=Log2[Length[eHe]]
SysDim=N[2^Nst];
(Matrices for the scalar product)
PS[1]="Id"; PS[2]="X"; PS[3]="Y"; PS[4]="Z";
S[1]=SparseArray[{{1,1}->N[1],{2,2}->N[1]}];
S[2]=SparseArray[{{1,2}->N[1],{2,1}->N[1]}];
S[3]=SparseArray[{{1,2}->N[-I],{2,1}->N[I]}];
S[4]=SparseArray[{{1,1}->N[1],{2,2}->N[-1]}];
(Parameters for the ParallelDo)
Soglia=10^-10;
Hper=eHe;
Tr2[A_,B_]:=With[{X=Most[ArrayRules[A]]},Conjugate[X[[;;,2]]].Extract[B,X[[;;,1]]]];
NstX=3; (* should be ok if num of kernels is a divisor of 4^NstX *)
(ParallelTable)
resHper=Join@@ParallelTable[Reap[
    Do[With[{el=Join[elX,elY]},
         With[{tmp=Tr2[Apply[KroneckerProduct,Map[S,el]],Hper]/SysDim},
           If[Abs[tmp]>Soglia,Sow[Prepend[Map[PS,el],tmp]]]]],
       {elY,Tuples[{1,2,3,4},Nst-NstX]}]][[2,1]],
  {elX,Tuples[{1,2,3,4},NstX]}];

The result in resHper should agree with OP's result up to reordering. It does give a considerable parallel speed-up when compared to Table.

(Final comment: I do not think this is optimal. Doing the calculation separately for every el is not optimal. In my code for example, one could contract Hper with just Map[S,elX] first, and only contract with Map[S,elY] in an inner loop, or something like this. Will stop here.)

This is a massive improvement! As a small comment, it seems it is actually advantageous if matrix A is much sparser than matrix B... Works for me, so thanks a lot! — Eurabio, Dec 06 '22 at 21:07
Fantastic, this works perfectly! I was ready to give up, due to a comment above, but your code works fantastically :)! — Eurabio, Dec 08 '22 at 14:30

My ParallelDo does not work

1 Answers1