Why ParallelTable is even slower than Table for large tensor construction?

Question

I would like to create a large rank-4 tensor by using both Table and ParallelTable.

What is actually constructed is given below: $\mathcal{L}_{m_1,m_2,V_1,V_2}=i\left(\xi^*_2(m_1,v_1)-\xi^*_2(m_2,v_2)\right)V_d(m_1,v_1)V_d(v_2,m_2)$

where the function $\xi_2(m,v;V)=i\frac{\beta_2^2}{2\beta_1^2}\left[1-\left(e^{\frac{E_{v}-E_m-V/2}{kT}}+1\right)^{-1}\right]\sqrt{4\beta^2_1-(E_{v}-E_m-V/2)^2}$

and matrix elements $V_d(m,v)=\langle m|v\rangle=e^{-\lambda^2/2\Omega^2}\sum_{i=0}^{\nu}\sum_{j=0}^{m}\delta_{m-j,\nu-i}(-1)^j\left(\frac{\lambda}{\Omega}\right)^{i+j}\frac{1}{i!j!}\sqrt{\frac{m!\nu!}{(m-j)!(v-i)!}}$

But I don't know why ParallelTable didn't give any improvement on the performance.

See the code below

Clear["Global`*"]
Ns = 41; (*number of basis*)
Nb = 7;
V = 1.84;
\[Lambda] = 0.3;
\[CapitalOmega] = 0.5;
\[Epsilon]0 = 0.5;
\[Beta]1 = 1; \[Beta]2 = 0.05; kT = 0.0259/300*10; (*10K*)
(*Franck-Condon factors*)
FK = E^(-[Lambda]^2/(2 [CapitalOmega]^2)) Table[!(
*UnderoverscriptBox[([Sum]), (i = 0), ([Nu])](
*UnderoverscriptBox[([Sum]), (j = 
        0), (m)]((KroneckerDelta[m - j, [Nu] - i])) 
*SuperscriptBox[(((-1))), (j)] 
*SuperscriptBox[((
*FractionBox[([Lambda]), ([CapitalOmega])])), (i + j)] 
*FractionBox[(1.), ((i!) (j!))] 
*SqrtBox[
FractionBox[((m!) ([Nu]!)), ((((m - j))!) ((([Nu] - 
             i))!))]])), {m, 0, Ns - 1}, {[Nu], 0, Ns - 1}];
(Eigenenergies of oscillator)
Em = Table[[CapitalOmega] (m + 1/2), {m, 0, Ns - 1}];
(Eigenenergies of shifted oscillator)
E[Nu] = Table[[CapitalOmega] ([Nu] + 1/
       2) + [Epsilon]0 - [Lambda]^2/[CapitalOmega], {[Nu], 0, 
    Ns - 1}];
(Define elementary functions)
cf = Compile[{{x, Real}}, 
   If[x < -300., 1., If[x > 300., 0., 1./(1. + Exp[x])]], 
   CompilationTarget -> "C", RuntimeAttributes -> {Listable}, 
   Parallelization -> True, RuntimeOptions -> "Speed"];
c[CapitalGamma]0 = 
  Compile[{{E, _Real}, {qV, _Real}, {[Beta]1, _Real}, {[Beta]2, 

_Real}}, If[
    Abs[E - qV/2.] <= 
     2 [Beta]1, [Beta]2^2/[Beta]1^2 Sqrt[
      4 [Beta]1^2 - (E - qV/2.)^2], 0.], CompilationTarget -> "C", 
   RuntimeAttributes -> {Listable}, Parallelization -> True, 
   RuntimeOptions -> "Speed"];
[Xi]2L[E, V_] := 
  I/2 (1 - cf[(E - V/2)/kT]) c[CapitalGamma]0[E, 
    V, [Beta]1, [Beta]2];
[Xi]2Lmv = 
Table[[Xi]2L[E[Nu][[v + 1]] - Em[[m + 1]], V], {m, 0, Ns - 1}, {v,
 0, Ns - 1}];
Table[ (Conjugate[[Xi]2Lmv[[m1, 
       V1]]] - [Xi]2Lmv[[m2, V2]]) (FK[[m1, V1]] FK[[m2, V2]])
   , {m1, 1, Ns}, {V1, 1, Ns}, {m2, 1, Ns}, {V2, 1, 
    Ns}]; // AbsoluteTiming
ParallelTable[ (Conjugate[[Xi]2Lmv[[m1, 
       V1]]] - [Xi]2Lmv[[m2, V2]]) (FK[[m1, V1]] FK[[m2, V2]])
   , {m1, 1, Ns}, {V1, 1, Ns}, {m2, 1, Ns}, {V2, 1, 
    Ns}]; // AbsoluteTiming

Have you read this post?: https://mathematica.stackexchange.com/q/48295/1871 — xzczd, Aug 23 '21 at 11:54
Thanks for the useful link. In my case, I assume the ParallelTable can split my data in a correct manner automatically. For a given (m1,m2,v1,v2) the result is unique, shouldn't ParallelTable work effectively in this case? — Bob Lin, Aug 23 '21 at 12:19

Henrik Schumacher · Accepted Answer · 2022-07-07T18:04:46.527

I don't know why ParallelTable is so slow. But I think that an answer to this would not really help you (things can be like they are for quite stupid reasons...).

What will help you more in the future is this: It is generally not a good idea to use Table or ParallelTable for linear algebra or big data processing. Remember: Mathematica is an interpreted language. That generally means that loop constructs are quite slow. (Imagine the interpreter as a control freak that wants to check each single loop iteration.) Well, there is JIT compilation, but for some reason, it does not seem to be active here.

So it is often better to use built-in commands that utilize compiled back ends (that are vectorized and parallelized).

As an example, here is another way to compute your array that is almost two orders of magnitude faster:

a0 = Table[
     (Conjugate[\[Xi]2Lmv[[m1, V1]]] - \[Xi]2Lmv[[m2, V2]]) (FK[[m1, V1]] FK[[m2, V2]]), 
     {m1, 1, Ns}, {V1, 1, Ns}, {m2, 1, Ns}, {V2, 1, Ns}
     ]; // AbsoluteTiming // First
a1 = Transpose[
     ArrayReshape[
      Times[
       Plus[
        KroneckerProduct[Conjugate[[Xi]2Lmv], ConstantArray[1., {Ns, Ns}]],
        KroneckerProduct[ConstantArray[-1., {Ns, Ns}], [Xi]2Lmv]
        ],
       KroneckerProduct[FK, FK]
       ],
      {Ns, Ns, Ns, Ns}
      ],
     {1, 3, 2, 4}]; // AbsoluteTiming // First
Max[Abs[a0 - a1]]

6.67575

0.08598

0.

Edit 1

Of course, compiled code is still faster since it requires fewer copy operations:

cg = Compile[{{\[Xi]2Lmv, _Complex, 2}, {FK, _Real, 2}},
   Table[ 
    (Conjugate[Compile`GetElement[\[Xi]2Lmv, m1, V1]] - 
       Compile`GetElement[\[Xi]2Lmv, m2, V2]) (Compile`GetElement[FK, 
        m1, V1] Compile`GetElement[FK, m2, V2]),
    {m1, 1, Compile`GetElement[Dimensions[\[Xi]2Lmv], 1]},
    {V1, 1, Compile`GetElement[Dimensions[\[Xi]2Lmv], 2]},
    {m2, 1, Compile`GetElement[Dimensions[\[Xi]2Lmv], 1]},
    {V2, 1, Compile`GetElement[Dimensions[\[Xi]2Lmv], 2]}
    ],
   CompilationTarget -> "C",
   RuntimeOptions -> "Speed"
   ];
a2 = cf[[Xi]2Lmv, FK]; // AbsoluteTiming // First
Max[Abs[a0 - a2]]

0.035156

But if you do not have to run this several time, the compilation time won't be amortized.

Edit 2

One of your former (now rolled back) edits showed me that one can exploit the distributive law to save a couple of flops:

a2 = Transpose[
     ArrayReshape[
      Subtract[
       KroneckerProduct[Conjugate[\[Xi]2Lmv] FK, FK],
       KroneckerProduct[FK, FK \[Xi]2Lmv]
       ],
      {Ns, Ns, Ns, Ns}
      ],
     {1, 3, 2, 4}
     ]; // AbsoluteTiming // First
Max[Abs[a0 - a2]]/Max[Abs[a0]]

0.054181

1.86147*10^-16

The timing is already pretty close to the compiled code. Notice that some (negligible) rounding error is introduced here because the distributive law actually does not hold exactly for floating point numbers.

No, but oftentimes easier to read. Subtract is somtimes faster on tensors than - because #1-#2& behaves like Plus[#1,Minus[#2]]&, thus there is one more floating point operation. — Henrik Schumacher, Aug 23 '21 at 17:04
How do you figure out the order ({1,3,2,4} ) in the Transpose? — Bob Lin, Aug 23 '21 at 20:34
That's because of the identity KroneckerProduct[A, B] == Flatten[ Transpose[TensorProduct[A, B], {1, 3, 2, 4}], {{1, 2}, {3, 4}}]for matrices A and B. — Henrik Schumacher, Aug 23 '21 at 21:31
Thanks, I realized actually KroneckerProduct[Conjugate[[Xi]2Lmv] FK, FK]+ KroneckerProduct[FK, FK [Xi]2Lmv] is enough to generate the correct result. — Bob Lin, Aug 24 '21 at 13:38
If you can avoid Transpose and Reshape by an alternative data layout, then the better! — Henrik Schumacher, Aug 24 '21 at 13:40

Why ParallelTable is even slower than Table for large tensor construction?

1 Answers1

Edit 1

Edit 2