5

My Nbody code use a lot of Outer[Subtract, vect1,vect2]. A lot means ten thousands to one hundred thousands times !! This is at the heart of my Nbody code, and I need to speed up this instruction. Each vector contains for example N=10000 values, and it scales like N^2.

By compiling this instruction , I obtain a factor of 4 faster (4 CPU !!). Not bad , but I dont want to buy 128 CPU !.

I dont understand for example why Outer[Times,vect1,vect2] is about 100 times faster than Outer[Subtract, vect1,vect2].It would be great to get this speed-up !

Is there a way to speed up Outer[Subtract, vect1,vect2]? Using C++, or using GPU , or else ??

Thank you for getting some advices.

Rémy Galli
  • 345
  • 1
  • 7
  • Please see this question and the corresponding answers and comments. – Domen Feb 15 '24 at 10:26
  • 1
    This is an XY-problem. You are using an all-pairs algorithm which is inherintly $O(N^2)$. You should read about the Barnes-Hut method for $N$-body problems; it approximates the correct result in $O(N \log(N))$ time and space. – Henrik Schumacher Feb 15 '24 at 13:20
  • Also have a look into the Mathematica function NBodySimulation. – Henrik Schumacher Feb 15 '24 at 13:21
  • 1
    Even if you don't use Barnes-Hut: Using Outer leads to $O(N^2)$ memory consumption. And since the memory hierachy of this size rather slow, you should rather go for a (compiled!) double loop that needs only $O(N)$ memory to store the results (positions, velocities, forces). – Henrik Schumacher Feb 15 '24 at 14:05
  • Btw., you should rather use Outer[Subtract, vect1,vect2,1] (mind the fourth argument). – Henrik Schumacher Feb 15 '24 at 14:10
  • "why Outer[Times,vect1,vect2] is about 100 times faster than Outer[Subtract, vect1,vect2]": For first arguments Times, Plus, and List and packed vectors for vect1 and vect2, the reason is special-case implementations using SIMD and vectorization. For optimal speed, it is also important that you don't overflow CPU caches (N=10000 too big?). (List takes twice as long usually because it constructs an array that has twice as many entries as Times or Plus.) I think that a similar implementation for Subtract should be possible (in, say, C), but that's beyond my skillset to do so. – Goofy Feb 15 '24 at 18:23

3 Answers3

16
x = RandomReal[1., 10000];
y = RandomReal[1., 10000];
a = Outer[Subtract, x, y]; // RepeatedTiming
b = Outer[Plus, x, -y]; // RepeatedTiming 
a == b

{13.667, Null}

{0.270628, Null}

True

Karl
  • 941
  • 1
  • 7
5

In case you want to do that for vector-valued data (because you mentiond N-body problems): Outer seems to be best optimized for scalar outputs. So working to "structure of arrays" format might be useful.

n = 10^4;
P = RandomReal[{-1, 1}, {n, 3}];
Q = RandomReal[{-1, 1}, {n, 3}];

RTimes = Outer[Times, P, Q, 1]; // AbsoluteTiming // First RSubtract = Outer[Subtract, P, Q, 1]; // AbsoluteTiming // First RMinus = Outer[Plus, P, -Q, 1]; // AbsoluteTiming // First

2.14531

15.3862

1.84677

Here the version that I mean. (I am using Karl's very good idea here.)

(*Convert to structure of arrays format.*)
PT = Transpose[P];
QT = Transpose[Q];

(Get result in structure of arrays format.) RT = { Outer[Plus, PT[[1]], -QT[[1]]], Outer[Plus, PT[[2]], -QT[[2]]], Outer[Plus, PT[[3]], -QT[[3]]] }; // AbsoluteTiming // First

Dimensions[RT]

(Convert to array of structures format (just for comparison; don't do that in your code!)) R = Transpose[RT, {3, 1, 2}];

Max[Abs[RMinus - R]]

0.350867

{3, 10000, 10000}

0.

Nonetheless, you should rather use the Barnes-Hut method or the fast multipole method for approximating the interactions instead. There are a bazillion of code bases out there. Most of them will be written in Fortran, C, or C++, though. One cannot code them efficiently in an interpreted language like Mathematica. So I suggest you look for some C++ library and link it with LibraryLink.

Edit

I also compared to a parallelized implementation in C++ and even a blocked version (for avoiding cache misses). Both take about 0.143006 seconds. Single-threaded they do it in 0.325479, which shows that this problem is likely memory bound. (I know from other experiments that two of my 8 CPU cores can fully satiate the memory bandwitdh of my machine.) So I don't think that you can get out much better performance here.

Henrik Schumacher
  • 106,770
  • 7
  • 179
  • 309
2

I can not explain what Outer with Subtract does. But if you do not use Outer you can get a similar performance as with Time by writing:

n = 10^4;
v1 = RandomReal[{-1, 1}, n];
v2 = RandomReal[{-1, 1}, n];

Outer[Times, v1, v2]; // AbsoluteTiming Outer[Subtract, v1, v2]; // AbsoluteTiming (v2 - #) & /@ v1; // AbsoluteTiming

enter image description here

Daniel Huber
  • 51,463
  • 1
  • 23
  • 57