Can someone boost even more my code?

Question

UPDATE

As suggested by @Roman, I've included here all my code.

I'm using just bult-in function and compile to boost my code, but I think that it can be better. My code looks like

    nof = 30;
    << NumericalDifferentialEquationAnalysis`;
        gqx = GaussianQuadratureWeights[nof, 0, a]; gqy = 
         GaussianQuadratureWeights[nof, 0, b];
        xi = gqx[[All, 1]]; yi = gqy[[All, 1]]; wix = gqx[[All, 2]]; wiy = 
         gqy[[All, 2]];
        nM = 10; nN = 10;
        dim = nM*nN;
        mVec = Range[1, nM];
        nVec = Range[1, nN];
        weigth = Flatten@KroneckerProduct[{wix}, {wiy}];
        D11[x_,y_] = 115.2 - 1.39201 Cos[1.37428 x] - 30.1568 Cos[2.19884 x] - 
     0.0166422 Cos[2.74855 x] + 13.0219 Cos[3.57312 x] - 
     9.85381 Cos[4.39768 x] - 6.94062 Cos[7.14623 x] - 
     3.20871 Cos[8.79536 x] - 1.44146 Sin[1.37428 x] + 
     67.7332 Sin[2.19884 x] + 0.476569 Sin[2.74855 x] - 
     35.7775 Sin[3.57312 x] - 27.0025 Sin[4.39768 x] - 
     5.82387 Sin[7.14623 x] - 0.920082 Sin[8.79536 x];   
        mat1 = Flatten@
            Table[(2 π^4)/a^4 D11[x, y], {x, xi}, {y, 
              yi}]; // RepeatedTiming

        mat2 = Compile[{{x1, _Real, 1}, {y1, _Real, 1}, {m1, _Real, 
               1}, {n1, _Real, 1}, {p1, _Real, 1}, {q1, _Real, 
               1}, {a, _Real}, {b, _Real}, {nof, _Integer}}, 
             Partition[
              Flatten@Table[
                m^2 p^2 Sin[(m π x)/a] Sin[(p π x)/a] Sin[(n π y)/
                  b] Sin[(q π y)/b], {m, m1}, {n, n1}, {p, p1}, {q, 
                 q1}, {x, x1}, {y, y1}], nof^2], Parallelization -> True, 
             RuntimeAttributes -> {Listable}][xi, yi, mVec, nVec, mVec, nVec, 
            a, b, nof]; // RepeatedTiming

        mat3 = Compile[{{u, _Real, 1}, {v, _Real, 1}}, u v, 
             RuntimeAttributes -> {Listable}, Parallelization -> True][mat2, 
            mat1]; // RepeatedTiming

        D11Mat = Compile[{{mat1, _Real, 2}, {mat2, _Real, 1}, {dim, _Integer}},
             Partition[mat1.mat2, dim],
             Parallelization -> True,
             RuntimeAttributes -> {Listable}][mat3, weigth, 
            dim]; // RepeatedTiming

        D11Mat = Partition[mat3.weigth, dim]; // RepeatedTiming

Running it, I got the following computing time

{0.035, Null}

{1.80, Null}

{0.028, Null}

{0.0032, Null}

{0.0027, Null}

It can be seen that mat2 is the bottleneck of my code. As I need to perform that computation over 600-1000 times, any small time saving on it will be great.

P.S.: D11[x,y] varies in each loop, so I cannot solve it analytically.

Generally, if you can, you should try to put all the definitions inside the code block so that people can simply copy the entire code block, paste it into Mathematica, and run it. I can't seem to get your code to run in 1.80 seconds even when I define xi and yi. — MassDefect, May 23 '20 at 17:20
It was a mistake, @MassDefect. I'm going to update. Can you try to run now? — Riobaldo Tatarana, May 23 '20 at 17:48
Are you going to be summing over the resulting matrix elements somehow? If yes, then a more direct approach via discrete sine transform may be appropriate. — Roman, May 23 '20 at 18:10
I am going to multiply by a vector and then a dot product, @Roman. What direct approach would be it? — Riobaldo Tatarana, May 23 '20 at 18:32
Maybe include the multiplication by a vector and the dot product in your code, so that people can try to optimize the whole process. — Roman, May 23 '20 at 18:46
How come that after this long discussion this code is still not self-contained? What are GaussianQuadratureWeights, a, and b? Maybe you meant two define D11[x_, y_] = instead of D11[x, y] =... — Henrik Schumacher, May 23 '20 at 19:38
@HenrikSchumacher, GaussianQuadratureWeights are Gaussian points to perform a numerical integration. a and b are dimensions of my problem. Yes, it is D11[x_,y_]. — Riobaldo Tatarana, May 23 '20 at 19:51
Once again: Why can't we just copy the code from here and run it in Mathematica? You are seeking for help, right? So it is in your own interest to make it as easy to help you as possible. One cannot speed up code in a reliable way without running it. And concrete dimensions of the problem are important as well as definitions for each other symbol that appears in the code. — Henrik Schumacher, May 23 '20 at 19:58
A thousand apologies, @HenrikSchumacher. I forgot the most important thing: load the package. << NumericalDifferentialEquationAnalysis;`. So sorry, again. I really appreciate if you yet can help me. I update the Mathematica block. — Riobaldo Tatarana, May 23 '20 at 20:04
A stupid and perhaps irrelevant comment: why don't you try to compile with C target? — Dario Rosa, May 23 '20 at 20:08
Using RuntimeOptions -> "Speed", CompilationTarget -> "C" shave quite some in mat2, if to compile it first and then test mat2[xi, yi, mVec, nVec, mVec, nVec, a, b, nof]; // RepeatedTiming — Andrew, May 29 '20 at 10:34
@Andrew, the ordering matter of these options? When I try what you recommend, my laptop stuck. — Riobaldo Tatarana, May 29 '20 at 12:28
Idk, for CompilationTarget -> "C" to work some C compiler (which WM should recognize) must be installed in the system. — Andrew, May 29 '20 at 12:32
@Andrew, I have gcc installed in my laptop. This is weird because I've used CompilationTarget -> "C" before and it works. Did you try to run on your machine? — Riobaldo Tatarana, May 29 '20 at 12:38
@Andrew, I've restarted my laptop and it worked. Restarting is always the best first attempt to troubleshoot the machine =). Regarding computing, on my laptop, I got 0.11 s to compile + 0.48 s to execute. It seems to be better to compile and run at the same time. — Riobaldo Tatarana, May 29 '20 at 13:11

Henrik Schumacher · Accepted Answer · 2020-05-29T17:50:52.933

Exploitation of low rank structure

The ordering of summation/dot products is crucial here. As aooiiii pointed out, mat2 has a low-rank tensor product structure. So by changing the order of summation/dotting operations, we can make sure that this beast is never assembled explicitly. A good rule of thumb is to sum intermediate results as early as possible. This reduces the number of flops and, often more importantly, the amount of memory that has to be shoved around by the machine. As a simple example consider the sum over all entries of the outer product of two vector x = {x1,x2,x3} and y = {y1,y2,y3}: First forming the outer product requires $9 = 3 \times 3$ multiplications and summing all entries requires $8 = 3 \times 3 -1$ additions.

 Total[KroneckerProduct[x, y], 2]

x1 y1 + x2 y1 + x3 y1 + x1 y2 + x2 y2 + x3 y2 + x1 y3 + x2 y3 + x3 y3

However first summing the vectors and then multiplying requires only $4 = 2 \times (3-1)$ additions and one multiplication:

 Total[x] Total[y]

(x1 + x2 + x3) (y1 + y2 + y3)

For vectors of length $n$, this would be $2 n^2 -1$ floating point operations in the first case vs. $2 (n -1) +1$ in the second case. Moreover, the intermediate matrix requires $n^2$ additional units of memory while storing $x$ and $y$ can be done with only $2 n$ units of memory.

Side note: In the "old days" before FMA (fused multiply-add) instructions took over, CPUs had separate circuits for addition and multiplication. On such machines, multiplication was more expensive than addition and thus this optimization is particularly striking. (My current computer, a Haswell (2014), has still a pure addition circuit, so those days are not that old...)

Code

Further speed-up can be obtained by using packed arrays throughout and by replacing all occurrences of Table in high-level code either by vectorized operations or compiled code.

This part of the code needs to be executed only once:

Needs["NumericalDifferentialEquationAnalysis`"];
nof = 30;
a = 1.;
b = 1.;
{xi, wix} = Transpose[Developer`ToPackedArray[GaussianQuadratureWeights[nof, 0, a]]];
{yi, wiy} = Transpose[Developer`ToPackedArray[GaussianQuadratureWeights[nof, 0, b]]];

First@RepeatedTiming[
  Module[{m = N[mVec], n = N[nVec], u, v},
    u = Sin[KroneckerProduct[xi, m (N[Pi]/a)]].DiagonalMatrix[SparseArray[m^2]];
    v = Sin[KroneckerProduct[yi, n (N[Pi]/b)]];
    U = Transpose[MapThread[KroneckerProduct, {u, wix u}], {3, 1, 2}];
    V = MapThread[KroneckerProduct, {wiy v, v}];
    ];
  ]

0.000164

This part of the code has to be evaluated whenever D11 changes:

First@RepeatedTiming[

  cf = Block[{i},
    With[{code = D11[x,y] /. y -> Compile`GetElement[Y, i]},
     Compile[{{x, _Real}, {Y, _Real, 1}},
      Table[code, {i, 1, Length[Y]}],
      RuntimeAttributes -> {Listable},
      Parallelization -> True,
      RuntimeOptions -> "Speed"
      ]
     ]
    ];

  result = ArrayReshape[
    Transpose[
     Dot[U, (2. π^4/a^4 ) cf[xi, yi], V],
     {1, 3, 2, 4}
     ],
    {dim, dim}
    ];

  ]

0.00065

On my systen, roughly 40% of this timing are due to compilation of cf. Notice that the first argument of cf is a scalar, so inserting a vector (or any other rectangular array) as in cf[xi, yi] will call cf in a threadable way (using OpenMP parallelization IRRC). This is the sole purpose of the option Parallelization -> True; Parallelization -> True does nothing without RuntimeAttributes -> {Listable} or if cf is not called in such a threadable way. From what OP told me, it became clear that the function D11 changes frequently, so cf had to be compiled quite often. This is why compiling to C is not a good idea (the C-compiler needs much more time),

Finally, checking the relative error of result:

Max[Abs[D11Mat - result]]/Max[Abs[D11Mat]]

4.95633*10^-16

Explanation attempt

Well, the code might look mysterious, so I try to explain how I wrote it. Maybe that will help OP or others next time when they stumble into a similar problem.

The main problem here is this beast, which is the Flattening of a tensor of rank $6$:

W = Flatten@ Table[
 m^2 p^2 Sin[(m π x)/a] Sin[(p π x)/ a] Sin[(n π y)/b] Sin[(q π y)/b],
 {m, mVec}, {n, nVec}, {p, mVec}, {q, nVec}, {x, xi}, {y, yi}
 ];

The first step is to observe that the indices m, p, and x "belong together"; likewise we put n, q and y into a group. Now we can write W as an outer product of the following two arrays:

W1 = Table[ 
  m^2 p^2 Sin[(m π x)/a] Sin[(p π x)/a], 
  {m, mVec}, {p, mVec}, {x, xi}
  ];
W2 = Table[
  Sin[(n π y)/b] Sin[(q π y)/b], 
  {n, nVec}, {q, nVec}, {y, yi}
  ];

Check:

Max[Abs[W - Flatten[KroneckerProduct[W1, W2]]]]

2.84217*10^-14

Next observation: Up to transposition, W1 and W2 can also be obtained as lists of outer products (of things that can be constructed also by outer products and the Listable attribute of Sin):

u = Sin[KroneckerProduct[xi, m (N[Pi]/a)]].DiagonalMatrix[ SparseArray[m^2]];
v = Sin[KroneckerProduct[yi, n (N[Pi]/b)]];

Max[Abs[Transpose[MapThread[KroneckerProduct, {u, u}], {3, 1, 2}] - W1]]
Max[Abs[Transpose[MapThread[KroneckerProduct, {v, v}], {3, 1, 2}] - W2]]

7.10543*10^-14

8.88178*10^-16

From reverse engineering of OP's code (easier said than done), I knew that the result is a linear combination of W1, W2, wix, wiy, and the following matrix

A = (2 π^4)/a^4 Outer[D11, xi, yi];

The latter is basically the array mat1, but not flattened out. It was clear that the function D11 was inefficient, so I compiled it (in a threadable way) into the function cf, so that we can obtain A also this way

A = (2 π^4)/a^4 cf[xi, yi];

Next, I looked at the dimensions of these arrays:

Dimensions[A]
Dimensions[W1]
Dimensions[W2]
Dimensions[wix]
Dimensions[wiy]

{30, 30}

{10, 10, 30}

{10, 10, 30}

{30}

{30}

So there were only a few possibilities left to Dot these things together. So, bearing in mind that u and wix belong to xi and that v and wiy belong to yi, I guessed this one:

intermediateresult = Dot[
   Transpose[MapThread[KroneckerProduct, {u, u}], {3, 1, 2}],
   DiagonalMatrix[wix],
   A,
   DiagonalMatrix[wiy],
   MapThread[KroneckerProduct, {v, v}]
   ];

I was pretty sure that all the right numbers were contained already in intermediateresult, but probably in the wrong ordering (which can be fixed with Transpose later). To check my guess, I computed the relative error of the flattened and sorted arrays:

(Max[Abs[Sort[Flatten[D11Mat]] - Sort[Flatten[intermediateresult]]]])/Max[Abs[D11Mat]]

3.71724*10^-16

Bingo. Then I checked the dimensions:

Dimensions[intermediateresult]
Dimensions[D11Mat]

{10, 10, 10, 10}

{100, 100}

From the way D11Mat was constructed, I was convinced that up to a transposition, intermediateresult is just an ArrayReshaped version of D11Mat. Being lazy, I just let Mathematica try all permutations:

Table[
  perm -> 
   Max[Abs[ArrayReshape[
       Transpose[intermediateresult, perm], {dim, dim}] - D11Mat]],
  {perm, Permutations[Range[4]]}
  ]

{{1, 2, 3, 4} -> 6.01299*10^7, {1, 2, 4, 3} -> 6.01299*10^7, {1, 3, 2, 4} -> 2.23517*10^-8, ...}

Then I just picked the one with the smallest error (which was {1,3,2,4}). So our result can be constructed like this:

result = ArrayReshape[
   Transpose[
    Dot[
     Transpose[MapThread[KroneckerProduct, {u, u}], {3, 1, 2}],
     DiagonalMatrix[wix],
     A,
     DiagonalMatrix[wiy],
     MapThread[KroneckerProduct, {v, v}]
     ],
    {1, 3, 2, 4}
    ],
   {dim, dim}];

Of course, one should confirm this by a couple of randomized tests before one proceeds.

The rest is onky about a couple of local optimizations. Multiplication with DiagonalMatrixs can usually replaced by threaded multipication. Know that, I searched for places to stuff the weights wix and wiy and found this possibility:

result = ArrayReshape[
   Transpose[
    Dot[
     Transpose[MapThread[KroneckerProduct, {u, wix u}], {3, 1, 2}],
     A,
     MapThread[KroneckerProduct, {wiy v, v}]
     ],
    {1, 3, 2, 4}
    ],
   {dim, dim}];

Then I realized that the first and third factor of the Dot-product can be recycled; this is why I stored them in U and V. Replacing A by (2 π^4)/a^4 cf[xi, yi] then led to the piece of code above.

Addendum

Using MapThread is actually suboptimal and can be improved by CompiledFunction:

cg = Compile[{{u, _Real, 1}, {w, _Real}},
   Block[{ui},
    Table[
     ui = w Compile`GetElement[u, i];
     Table[ui Compile`GetElement[u, j], {j, 1, Length[u]}]
     , {i, 1, Length[u]}]
    ]
   ,
   CompilationTarget -> "C",
   RuntimeAttributes -> {Listable},
   Parallelization -> True,
   RuntimeOptions -> "Speed"
   ];

And now

v = RandomReal[{-1, 1}, {1000, 10}];
w = RandomReal[{-1, 1}, {1000}];
V = w MapThread[KroneckerProduct, {v, v}]; // RepeatedTiming // First
V2 = cg[v, w]; // RepeatedTiming // First

0.0023

0.00025

But the MapThreads have to be run only once and it is already very fast for the array sizes in the problem. Moreover, for those sizes, cg is only twice as fast as MapThread. So there is probably no point in optimizing this out.

@aooiiii No need for that. You made the crucial observation that mat2 has this low-rank outer product structure; up to transposition, this leads to the factors U and V. Now one has to know that summation over a low-rank outer product can be sped up significantly by summing before forming the outer product. — Henrik Schumacher, May 23 '20 at 22:40
In fact, these days, you don't need to load that package anymore: {xi, wix} = Most[NIntegrate`GaussRuleData[nof, MachinePrecision]]. — J. M.'s missing motivation, May 24 '20 at 02:42
@Henrik, you are the wizard of performance-tuning. Thank you for the code and explanation. You rock, as always! — Riobaldo Tatarana, May 24 '20 at 11:57
@HenrikSchumacher, do you believe still today I'm trying to understand your code? Mainly the transpose parts over the code. =\ — Riobaldo Tatarana, May 29 '20 at 12:30
I added a section with the heuristic that I applied to derive the code. It is still a lot about intuition, but at least, I tried to chop things into smaller pieces. And one gets better at this each time one solves such a problem. Maybe it also helps you to see that it is not necessary to fully understand in advance how the code has to be written. An efficient implementation does not just fall from heaven. Maybe that is obvious but I met many people with that expectation. And it was this expectation that made them think to be bad at programming and thus unhappy. — Henrik Schumacher, May 29 '20 at 17:46
Instead, developing code is quite an organic process (for me at least). Often it is already great to have an implementation that does the right thing, no matter how slow it is, because that enables one to just test candidates for more efficient implementations. And hopefully you see in my post: that is just what I did: Educated guessing and thorough testing. — Henrik Schumacher, May 29 '20 at 17:46
@Magela, as Henrik notes, it's often useful just to have something that works at first, and then worry about optimizing it later (i.e. incremental development). And as I like to remind, Mathematica is often high-level enough that in most cases, you can quickly translate formulae from a textbook/paper into something that works. — J. M.'s missing motivation, May 30 '20 at 13:23

score 12 · Answer 2 · answered May 23 '20 at 20:19

I managed to achieve a 20-fold performance boost with the following ideas. First, the elements of your 6-dimensional intermediate array A[m, n, p, q, x, y] can be decomposed into pairwise products of X[m, p, x] and Y[n, q, y] - a square root reduction in trigonometric computations. Then, X and Y can be combined into A via heavily optimized functions Outer and Transpose.

cf = Compile[{{x1, _Real, 1}, {y1, _Real, 1}, {m1, _Real, 
    1}, {n1, _Real, 1}, {p1, _Real, 1}, {q1, _Real, 
    1}, {a, _Real}, {b, _Real}, {nof, _Integer}},
  Module[{X, Y},
   X = Table[
     m^2 p^2 Sin[(m \[Pi] x)/a] Sin[(p \[Pi] x)/a],
     {m, m1}, {p, p1}, {x, x1}];
   Y = Table[
     Sin[(n \[Pi] y)/b] Sin[(q \[Pi] y)/b],
     {n, n1}, {q, q1}, {y, y1}];
   Partition[#, nof^2] &@
    Flatten@Transpose[Outer[Times, X, Y], {1, 3, 5, 2, 4, 6}]
   ]
  ]

cf[xi, yi, mVec, nVec, mVec, nVec, a, b, nof]; // RepeatedTiming

That said, I expect @Roman's DST-based approach to be orders of magnitude faster.

thank you for answering. I really appreciate your approach! – Riobaldo Tatarana May 24 '20 at 11:58 — Riobaldo Tatarana, May 24 '20 at 11:58

Can someone boost even more my code?

2 Answers2

Exploitation of low rank structure

Code

Explanation attempt

Addendum