3

I would like to create a large rank-4 tensor by using both Table and ParallelTable.

What is actually constructed is given below: $\mathcal{L}_{m_1,m_2,V_1,V_2}=i\left(\xi^*_2(m_1,v_1)-\xi^*_2(m_2,v_2)\right)V_d(m_1,v_1)V_d(v_2,m_2)$

where the function $\xi_2(m,v;V)=i\frac{\beta_2^2}{2\beta_1^2}\left[1-\left(e^{\frac{E_{v}-E_m-V/2}{kT}}+1\right)^{-1}\right]\sqrt{4\beta^2_1-(E_{v}-E_m-V/2)^2}$

and matrix elements $V_d(m,v)=\langle m|v\rangle=e^{-\lambda^2/2\Omega^2}\sum_{i=0}^{\nu}\sum_{j=0}^{m}\delta_{m-j,\nu-i}(-1)^j\left(\frac{\lambda}{\Omega}\right)^{i+j}\frac{1}{i!j!}\sqrt{\frac{m!\nu!}{(m-j)!(v-i)!}}$

But I don't know why ParallelTable didn't give any improvement on the performance.

enter image description here See the code below

Clear["Global`*"]
Ns = 41; (*number of basis*)
Nb = 7;
V = 1.84;
\[Lambda] = 0.3;
\[CapitalOmega] = 0.5;
\[Epsilon]0 = 0.5;
\[Beta]1 = 1; \[Beta]2 = 0.05; kT = 0.0259/300*10; (*10K*)
(*Franck-Condon factors*)

FK = E^(-[Lambda]^2/(2 [CapitalOmega]^2)) Table[!( *UnderoverscriptBox[([Sum]), (i = 0), ([Nu])]( *UnderoverscriptBox[([Sum]), (j = 0), (m)]((KroneckerDelta[m - j, [Nu] - i])) *SuperscriptBox[(((-1))), (j)] *SuperscriptBox[(( *FractionBox[([Lambda]), ([CapitalOmega])])), (i + j)] *FractionBox[(1.), ((i!) (j!))] *SqrtBox[ FractionBox[((m!) ([Nu]!)), ((((m - j))!) ((([Nu] - i))!))]])), {m, 0, Ns - 1}, {[Nu], 0, Ns - 1}]; (Eigenenergies of oscillator) Em = Table[[CapitalOmega] (m + 1/2), {m, 0, Ns - 1}]; (Eigenenergies of shifted oscillator) E[Nu] = Table[[CapitalOmega] ([Nu] + 1/ 2) + [Epsilon]0 - [Lambda]^2/[CapitalOmega], {[Nu], 0, Ns - 1}]; (Define elementary functions) cf = Compile[{{x, Real}}, If[x < -300., 1., If[x > 300., 0., 1./(1. + Exp[x])]], CompilationTarget -> "C", RuntimeAttributes -> {Listable}, Parallelization -> True, RuntimeOptions -> "Speed"]; c[CapitalGamma]0 = Compile[{{E, _Real}, {qV, _Real}, {[Beta]1, _Real}, {[Beta]2,
_Real}}, If[ Abs[E - qV/2.] <= 2 [Beta]1, [Beta]2^2/[Beta]1^2 Sqrt[ 4 [Beta]1^2 - (E - qV/2.)^2], 0.], CompilationTarget -> "C", RuntimeAttributes -> {Listable}, Parallelization -> True, RuntimeOptions -> "Speed"]; [Xi]2L[E
, V_] := I/2 (1 - cf[(E - V/2)/kT]) c[CapitalGamma]0[E, V, [Beta]1, [Beta]2]; [Xi]2Lmv = Table[[Xi]2L[E[Nu][[v + 1]] - Em[[m + 1]], V], {m, 0, Ns - 1}, {v, 0, Ns - 1}]; Table[ (Conjugate[[Xi]2Lmv[[m1, V1]]] - [Xi]2Lmv[[m2, V2]]) (FK[[m1, V1]] FK[[m2, V2]]) , {m1, 1, Ns}, {V1, 1, Ns}, {m2, 1, Ns}, {V2, 1, Ns}]; // AbsoluteTiming ParallelTable[ (Conjugate[[Xi]2Lmv[[m1, V1]]] - [Xi]2Lmv[[m2, V2]]) (FK[[m1, V1]] FK[[m2, V2]]) , {m1, 1, Ns}, {V1, 1, Ns}, {m2, 1, Ns}, {V2, 1, Ns}]; // AbsoluteTiming

Alexey Popkov
  • 61,809
  • 7
  • 149
  • 368
Bob Lin
  • 445
  • 2
  • 8
  • 2
    Have you read this post?: https://mathematica.stackexchange.com/q/48295/1871 – xzczd Aug 23 '21 at 11:54
  • Thanks for the useful link. In my case, I assume the ParallelTable can split my data in a correct manner automatically. For a given (m1,m2,v1,v2) the result is unique, shouldn't ParallelTable work effectively in this case? – Bob Lin Aug 23 '21 at 12:19

1 Answers1

13

I don't know why ParallelTable is so slow. But I think that an answer to this would not really help you (things can be like they are for quite stupid reasons...).

What will help you more in the future is this: It is generally not a good idea to use Table or ParallelTable for linear algebra or big data processing. Remember: Mathematica is an interpreted language. That generally means that loop constructs are quite slow. (Imagine the interpreter as a control freak that wants to check each single loop iteration.) Well, there is JIT compilation, but for some reason, it does not seem to be active here.

So it is often better to use built-in commands that utilize compiled back ends (that are vectorized and parallelized).

As an example, here is another way to compute your array that is almost two orders of magnitude faster:

a0 = Table[
     (Conjugate[\[Xi]2Lmv[[m1, V1]]] - \[Xi]2Lmv[[m2, V2]]) (FK[[m1, V1]] FK[[m2, V2]]), 
     {m1, 1, Ns}, {V1, 1, Ns}, {m2, 1, Ns}, {V2, 1, Ns}
     ]; // AbsoluteTiming // First

a1 = Transpose[ ArrayReshape[ Times[ Plus[ KroneckerProduct[Conjugate[[Xi]2Lmv], ConstantArray[1., {Ns, Ns}]], KroneckerProduct[ConstantArray[-1., {Ns, Ns}], [Xi]2Lmv] ], KroneckerProduct[FK, FK] ], {Ns, Ns, Ns, Ns} ], {1, 3, 2, 4}]; // AbsoluteTiming // First

Max[Abs[a0 - a1]]

6.67575

0.08598

0.

Edit 1

Of course, compiled code is still faster since it requires fewer copy operations:

cg = Compile[{{\[Xi]2Lmv, _Complex, 2}, {FK, _Real, 2}},
   Table[ 
    (Conjugate[Compile`GetElement[\[Xi]2Lmv, m1, V1]] - 
       Compile`GetElement[\[Xi]2Lmv, m2, V2]) (Compile`GetElement[FK, 
        m1, V1] Compile`GetElement[FK, m2, V2]),
    {m1, 1, Compile`GetElement[Dimensions[\[Xi]2Lmv], 1]},
    {V1, 1, Compile`GetElement[Dimensions[\[Xi]2Lmv], 2]},
    {m2, 1, Compile`GetElement[Dimensions[\[Xi]2Lmv], 1]},
    {V2, 1, Compile`GetElement[Dimensions[\[Xi]2Lmv], 2]}
    ],
   CompilationTarget -> "C",
   RuntimeOptions -> "Speed"
   ];

a2 = cf[[Xi]2Lmv, FK]; // AbsoluteTiming // First Max[Abs[a0 - a2]]

0.035156

But if you do not have to run this several time, the compilation time won't be amortized.

Edit 2

One of your former (now rolled back) edits showed me that one can exploit the distributive law to save a couple of flops:

a2 = Transpose[
     ArrayReshape[
      Subtract[
       KroneckerProduct[Conjugate[\[Xi]2Lmv] FK, FK],
       KroneckerProduct[FK, FK \[Xi]2Lmv]
       ],
      {Ns, Ns, Ns, Ns}
      ],
     {1, 3, 2, 4}
     ]; // AbsoluteTiming // First
Max[Abs[a0 - a2]]/Max[Abs[a0]]

0.054181

1.86147*10^-16

The timing is already pretty close to the compiled code. Notice that some (negligible) rounding error is introduced here because the distributive law actually does not hold exactly for floating point numbers.

Henrik Schumacher
  • 106,770
  • 7
  • 179
  • 309