How to speed up large double sums in a table?

Question

I am calculating a 3-by-3 matrix whose elements are given as follows:

$$ M_{mn} = \frac{1}{N}\sum_{i=1}^N \sum_{j=1}^N (r^i_m - r^j_m)(r^i_n - r^j_n) \tag{1} $$

where $N$ is the total number of particles, and $r^i_m$ denotes the m-th component of the i-th particle's vector. The sums can be computationally expensive as often $N$ is around $10000.$

Below is my implementation of the matrix $M$ in Mathematica:

matrixM = Table[(1./npart)*
    Sum[Sum[(Part[vecs, i, m] - Part[vecs, j, m])*(Part[vecs, i, n] - 
         Part[vecs, j, n]), {j, 1, npart}], {i, 1, npart}], {m, 3}, {n, 3}];

and here vecs is the array of all particle vectors (so one line per particle and each line of the array is a vector of 3 components).

How could I speed up computations of this nature? Is there possibly a bottleneck in my efficiency due to the way I generate the matrix using Table or the fact that I access the components stored in a big array using Part? Any advice would be very helpful.

@zhk good point, I actually used npart as variable name in my own notebook, here as I was writing the post I ended up using same names as in eq. (1), now it's replaced by npart instead of N in the shown code. — , May 05 '19 at 12:13

Roman · Accepted Answer · 2019-05-05T17:34:12.037

12

Using only matrix operations makes this much faster (runtime scales quadratically with npart):

matrixM = With[{diff = Flatten[Outer[Subtract, vecs, vecs, 1], 1]},
  (Transpose[diff].diff)/npart];

For npart=10000 this will take a lot of memory though: the intermediate array diff will be huge.

update

After the discussions on the other solutions, it occurred to me that what's being calculated here is simply the covariance matrix of the vectors in vecs:

matrixM = 2*(npart-1)*Covariance[vecs]

Much much faster, no space issues, no stability issues!

edited May 05 '19 at 17:34

answered May 05 '19 at 12:59

Roman

47,322
2
55
121

This is incredible Roman! I just compared the absolute timings for both approaches, and yours takes 34 times less time to compute! Would you kindly add a brief explanation for how you spotted that the double sum could be translated into a matrix product? In case such translation is not immediately apparent (where the expression is a bit more involved), do you know what was causing the main slow-down in my implementation? – May 05 '19 at 13:21
3

@user929304 every sum over consecutive parts can be written as a dot product. It's a matter of patience and experience to figure out how exactly. As to what slows your implementation down: matrix-matrix dot products are one of the most efficient subroutines available on a computer, as they lie at the core of (almost) every algorithm and entire CPU architectures are designed around making them efficient. You cannot compete with that by writing down your own sum. See https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms – Roman May 05 '19 at 13:25
1

Very instructive, thanks a lot. – May 05 '19 at 13:37
What an update! Great observation, I myself wasn't aware I was in fact calculating the covariance matrix! Suddenly the computation has become instantaneous (7000 times speed up for one example I just tried). How can it be performed this rapidly! – May 05 '19 at 18:29
Great answer. Your methods are really beneficial, thank you! Does this apply generally to speeding up double sums? I was glad to see someone asking the question I have struggled to form, but it would not be generally applicable to a double sum of Tables? I have a feeling there are some impactful lessons I can pull from your method here, to perform quicker summation/accessing of tables, but I am struggling to locate it (due to my own inexperience). Can you provide a more expansive explanation? – CA Trevillian May 05 '19 at 21:16
1

@CATrevillian When you want to write fast code, try to avoid doing anything "by hand", like looping, summing, etc., and try to think in terms of higher-order abstractions: vectors instead of elements, matrices instead of lists of vectors, dot-products instead of summations, geometric/unitary/orthogonal transformations instead of matrix multiplications, etc. This is the core of mathematics, and used to be even more important back in the days before computers. – Roman May 06 '19 at 08:33
Hah, what you describe is what I think of when you say "by hand". Simplicial complexes really made me rethink how I describe a problem, and I have recently been working towards sitting down to study Euclid's Elements. I think the new GeometricScene function will be able to do some powerful things, once I learn enough in that area. All of that aside, thank you for confirming my suspicions! I really enjoy how much faster our processors are able to operate, when we pose the questions properly enough, and the physical elements that go into the software-hardware relationship. – CA Trevillian May 06 '19 at 08:46
@Roman Dear Roman, in case you find the time, your feedback on this post would be invaluable! Thanks a lot in advance. – Nov 05 '19 at 12:21
@user929304 for contractual work you can contact me at Uncompress["1:eJxTTMoPChZnYGAoys9NzNMrTs7IzUxNcSjNy0xKLNZLzgAAnn0Kkg=="]. Cheers! – Roman Nov 05 '19 at 21:31

JimB · Answer 2 · 2019-05-05T17:21:04.807

7

Your double sum can be written in terms of single sums.

$$M_{mn}=2\sum_{i=1}^N r_m^i r_n^i-2\sum_{i=1}^N r_m^i \sum_{i=1}^N r_n^i/N$$

A theoretically equivalent but more numerically stable version of the above (suggested by @Roman) is given by

$$M_{mn}=2\sum_{i=1}^N (r_m^i - \bar{r}_m^i)(r_n^i-\bar{r}_n^i)$$

where $\bar{r}_k^i$ is the mean of the $r_k^i$ values.

nparticles = 1000;
SeedRandom[12345];
r = RandomVariate[UniformDistribution[], {nparticles, 3}];

m = 1; n = 2;
AbsoluteTiming[
 Sum[(r[[i, m]] - r[[j, m]]) (r[[i, n]] - r[[j, n]])/nparticles, 
   {i, nparticles}, {j, nparticles}]]
(* {3.4060337777777776`,-2.42222014120762`} *)

AbsoluteTiming[
 2 Total[(r[[All, m]] - Total[r[[All, m]]]/nparticles) (r[[All, n]] - 
      Total[r[[All, n]]]/nparticles)]]
(* {0.00008335802469135802`,-2.4222201412076174`} *)

Using single sums is about 40,000 times faster for 1,000 particles.

edited May 05 '19 at 17:21

answered May 05 '19 at 16:06

JimB

41,653
3
48
106

2

I was going to write down the same thing and then decided that I wasn't convinced of the numerical stability of the transformation for large lists (difference of products instead of product of differences). For very large $N$ it may be advisable to first shift the numbers $r_m^i$ such that their averages are zero. Nonetheless I'm still wary of this way of doing the sum. It's fast though, sure! – Roman May 05 '19 at 16:43
@Roman The single sum approach can certainly be made more numerically stable. The resulting sum is pretty much the same as a estimating a covariance so using similar "robustifying" (if I can make up such a word) techniques (like standardizing the numbers with their respective means and standard deviations) would certainly be in order. – JimB May 05 '19 at 16:48
As a concrete example of my concern, using r = RandomVariate[UniformDistribution[], {nparticles, 3}] + 10^6 already gives the wrong answer with this fast method. You can fix this by first defining r1 = Transpose[Transpose[r] - Mean[r]] and then doing the fast method with r1 instead of r. – Roman May 05 '19 at 16:48
Yes these robustifications are what I meant. In the absence of these, I'd be careful with suggesting such methods, as an unsuspecting user can easily run into trouble with them. – Roman May 05 '19 at 16:50
3

@Roman "Unsuspecting users" have kept me in business for quite a long while. However, your point is well taken. I'll modify the above answer with what you've suggested. – JimB May 05 '19 at 17:02
1

see update on my answer: this robustification means using a built-in Covariance function! – Roman May 05 '19 at 17:34
Thank you likewise Jim for the suggested approach and spotting this neat equivalent form of the matrix! – May 05 '19 at 18:34

Henrik Schumacher · Answer 3 · 2019-05-05T15:33:05.113

This is a nice application of low rank factorization by Adaptive Cross Approximation. We require the first code block from this post which defines the function ACACompression.

Now we can do this:

npart = 5000;
SeedRandom[1234];
vecs = RandomReal[{-1, 1}, {npart, 3}];

(* Roman's proposal*)
matrixM = With[{diff = Flatten[Outer[Subtract, vecs, vecs, 1], 1]}, 
  (Transpose[diff].diff)/npart
  ]; // AbsoluteTiming // First

r = Transpose[vecs];
matrixMACA = Table[
     Block[{x, y, col, row},
      x = r[[i]];
      y = r[[j]];
      col = row = k \[Function] (x[[k]] - x) (y[[k]] - y)/npart;
      {U, V} = ACACompression[row, col];
      Total[U, {2}].Total[V, {2}]
      ],
     {i, 1, 3}, {j, 1, 3}]; // AbsoluteTiming // First

Max[Abs[matrixMACA - matrixM]]/Max[Abs[matrixM]]

16.8389

0.076707

6.0471*10^-15

Thus, utilizing ACACompression is about 200 times faster for 5000 points, returning the result correctly up to machine precision. For 100000 uniformly distributed points, it needs about 1 second. For 1000000 uniformly distributed points, it needs about 20 second (probably due to a memory bottlenecks).

The efficiency and accuracy depend quite much on the distribution of points, but one does not require overly uniformly distributed point clouds in order to make that work.

When the points are not in remarkably ill-poised, ACACompression will terminate with an almost exact low-rank factorization of the matrix that contains the summands of the double sum. For $N$ particles, ACACompression will need roughly $O(N)$ time and $O(N)$ memory for the factorization -- in contrast to Outer and the double Sums themselves which need at least $O(N^2)$ time. More severely, Outer needs also $O(N^2)$ memory. This is why using Outer becomes infeasible quite qickly for increasing $N$.

Many thanks Henrik! Excellent complement to the other answers, in particular learning the ACA method will be also very useful for all future and similar computations. This whole post has been so incredibly instructive. — , May 05 '19 at 18:36

How to speed up large double sums in a table?

3 Answers3

update