Performance tuning with memoisation, compilation and vectorization - further possibilities?

Question

In order to tune the performance of noticeably slow code of mine, I had a look at many different excellent posts on this topic here on M.SE, e.g. here to name the most interesting one.

Now I ask myself if I understood everything correctly or (perhaps) if there is even more tuning potential (for now and for future use cases). As the Latin proverb says "Verba docent, exempla trahunt" (= words teach but only examples really help) I am going to both describe what I learned and did according to that and to copy the optimized code here. Note, that sol is the solution of NDSolve with a vector of interpolation functions, W is a list of indices of the form i or {i,j,k} that are paired to be used inside ExpVal (for mathematical reasons {i, {k,l,m}} = {{k,l,m}, i} holds inside ExpVal so we can calculate only one half of the pairings).

I greatly acknowledge hints regarding further optimization potential or criticism regarding the code.

Optimization ideas

Vectorization: Try to use machine numbers as early as possible - I tried to do so by creating a vector of pure numbers inside n0[t] and work with it afterwards. Moreover, I have to pair each entry of h with each entry of h's conjugate so I used Abs for pairings of the same indices in h as well as hAdj.

Memoisation: Trade memory for performance - I tried to do so by including memoisation into the calculation of ExpVal.

Compilation: As the calculation of ExpVal happens extremely often, this part of the code is a major bottleneck. I compiled it to gain further performance.

Scoping constructs: With is faster as it doesn't create new local variables but only replaces given occurences so I used it instead of Module.

Detailed code description

On request, I will describe the code in more detail here. What the code does is to numerically simulate nonequilibrium physics via iterated equations of motion. Assume we have an operator $A(t) = \sum_i h_i(t) W_i$ whose time-dependence is completely mapped to the prefactors $h_i(t)$. Those prefactors are calculated via NDSolve whose solution is saved to sol. The $W_i$ are (quantum mechanical) basis operators consisting of one (i) or three elementary fermion operators ({k,l,m}). The typical BasisSize equals about 5000 different operators $W_i$.

The code has to calculate each possibility to pair two of the $W$ (say $W_i$ and $W_j$) and calculate their expectation value ExpVal. Those results are multiplied by the individual prefactors ($h_i$ and $h_j^*$) and summed up for the final result.

ev = Compile[{{i, _Integer}, {j, _Integer}, {kF, _Real}}, 2*If[i == j, kF/\[Pi], 1/\[Pi]*Sin[kF (i - j)]/(i - j)], RuntimeAttributes -> {Listable}, Parallelization -> True, RuntimeOptions -> "Speed", CompilationTarget -> "C"];

ExpVal[i_, j_, kF_] := ExpVal[i, j, kF] = ev[i, j, kF];

ExpVal[i_, b_List, kF_] := With[{m = b[[1]], n = b[[2]], o = b[[3]]}, 2 ev[i, m, kF] (2 ev[o, n, kF] - If[o == n, 1, 0])];

ExpVal[a_List, b_List, kF_] := With[{i = a[[1]], j = a[[2]], k = a[[3]], m = b[[1]], n = b[[2]], o = b[[3]]}, 8 ev[i, m, kF] (ev[j, k, kF] ev[o, n, kF] + ev[j, n, kF] (If[o == k, 1, 0] - ev[o, k, kF]))- If[j == k, 4 ev[i, m, kF] ev[o, n, kF], 0] - If[n == o, 4 ev[i, m, kF] ev[j, k, kF], 0] + If[j == k && n == o, 2 ev[i, m, kF], 0]];

hi[x_] = h[x] /. sol;

expAbs = Table[ExpVal[W[[i]], W[[i]], kF], {i, BasisSize}];
expPairs = Table[ExpVal[W[[i]], W[[j]], kF], {i, BasisSize - 1}, {j, i + 1, BasisSize}];

n0[t_] := Module[{x = t, h, hAdj, hAbs, hPairs, \[Mu], \[Xi]},
 h = Flatten[hi[x]];
 hAdj = Conjugate[h];
 hAbs = Abs[h]^2;
 hPairs = Table[h[[i]]*hAdj[[j]], {i, BasisSize - 1}, {j, i + 1, BasisSize}];

 \[Mu] = Plus @@ Flatten[hAbs*expAbs];
 \[Xi] = Plus @@ Flatten[hPairs*expPairs];
 Re[.5 (\[Mu] + \[Xi] + Conjugate[\[Xi]])]];

It would help quite a bit if you gave us a tractable example of sol, W and some values for kF and BasisSize. But first question: can you explain why W[[i]] is sometimes one and sometimes three indices? This will prevent W from being a packed array. If this can be modified, ExpVal could probably be made completely vectorized, i.e. we could do ExpVal[W,W,kF]. Finally, can you explain in words what the code is doing? — Marius Ladegård Meyer, Dec 14 '16 at 19:23
@MariusLadegårdMeyer Good remarks. You can assume kF to be $\pi/2$ but it doesn't matter, in principle. The rest of your questions is answered in the initial question under Detailed code description now. The fact that sometimes one and sometimes three indices participate is due to the commutator in the Heisenberg equations of motion and can't be change physically seen (but maybe in the code there is a possibility I don't see). — pbx, Dec 14 '16 at 21:29
You almost never want to use := in anything time critical. Mathematica code can be a rabbit hole of beautiful structure and short operations but nested := will generally be your downfall with anything large. — Ramble, Dec 14 '16 at 22:09
You also seem to still have a scattering of Integers instead of Reals. Its not always the case but 1 instead of 1.0 or Pi/2 instead of Pi/2.0 can be a real problem. — Ramble, Dec 14 '16 at 22:11
Sometimes its better to run the grunt work (things only requiring normal math operations) in something fast like C and manipulate the output using Mathematica. — Ramble, Dec 14 '16 at 22:13
@Ramble I feared somebody would say that ;-) Considering the fact that developing with MMA is much more fun (and lengths faster) than using plain C or C++, I tried to stick to MMA as long as possible. I could make i also a list {i} and check in ev for Lengths in order to vectorize everything. But I am not sure if this would help gaining performance. — pbx, Dec 15 '16 at 08:51
Agree:) Unfortunately I have been caught out too many times going down the rabbit hole of making my code more succinct or beautiful only to find its takes days to evaluate. Now If I have a large dataset where possible I try to prototype in MAA and move quickly to number crunching in something else. — Ramble, Dec 15 '16 at 15:52
The trouble is that even with vectoring you only need one or two code elements that work inefficiently within MMA and most your effort is hosed. You usually don't discover these until late in the process and by that time your optimized MMA code starts looking a lot like C (i.e. simple loops and iteration albeit vectorised). Things like Map and Parallel are black boxes and don't behave predictably when it comes to speed and compile doesn't work with most builtin functions). — Ramble, Dec 15 '16 at 16:00
The key is identifying how big the datasets your problem might be tackling early; as everything is always wonderful with the dummy test set. You also start hitting walls in trying to manipulate or copy results. Then the only answer is cutting everything up into smaller chunks..... You start off with something beautiful and end up slicing and contorting it into something like.... a conventional programming language only slower:) — Ramble, Dec 15 '16 at 16:04
@Ramble +1 Indeed, my code starts more and more to look like rubbish, unfortunately. You convinced me to have a look at Intel MKL and leave MMA behind for this project part. — pbx, Dec 15 '16 at 19:05

Performance tuning with memoisation, compilation and vectorization - further possibilities?

0 Answers0