In case you want to do that for vector-valued data (because you mentiond N-body problems): Outer seems to be best optimized for scalar outputs. So working to "structure of arrays" format might be useful.
n = 10^4;
P = RandomReal[{-1, 1}, {n, 3}];
Q = RandomReal[{-1, 1}, {n, 3}];
RTimes = Outer[Times, P, Q, 1]; // AbsoluteTiming // First
RSubtract = Outer[Subtract, P, Q, 1]; // AbsoluteTiming // First
RMinus = Outer[Plus, P, -Q, 1]; // AbsoluteTiming // First
2.14531
15.3862
1.84677
Here the version that I mean. (I am using Karl's very good idea here.)
(*Convert to structure of arrays format.*)
PT = Transpose[P];
QT = Transpose[Q];
(Get result in structure of arrays format.)
RT = {
Outer[Plus, PT[[1]], -QT[[1]]],
Outer[Plus, PT[[2]], -QT[[2]]],
Outer[Plus, PT[[3]], -QT[[3]]]
}; // AbsoluteTiming // First
Dimensions[RT]
(Convert to array of structures format (just for comparison; don't do that in your code!))
R = Transpose[RT, {3, 1, 2}];
Max[Abs[RMinus - R]]
0.350867
{3, 10000, 10000}
0.
Nonetheless, you should rather use the Barnes-Hut method or the fast multipole method for approximating the interactions instead. There are a bazillion of code bases out there. Most of them will be written in Fortran, C, or C++, though. One cannot code them efficiently in an interpreted language like Mathematica. So I suggest you look for some C++ library and link it with LibraryLink.
Edit
I also compared to a parallelized implementation in C++ and even a blocked version (for avoiding cache misses). Both take about 0.143006 seconds. Single-threaded they do it in 0.325479, which shows that this problem is likely memory bound. (I know from other experiments that two of my 8 CPU cores can fully satiate the memory bandwitdh of my machine.) So I don't think that you can get out much better performance here.
NBodySimulation. – Henrik Schumacher Feb 15 '24 at 13:21Outerleads to $O(N^2)$ memory consumption. And since the memory hierachy of this size rather slow, you should rather go for a (compiled!) double loop that needs only $O(N)$ memory to store the results (positions, velocities, forces). – Henrik Schumacher Feb 15 '24 at 14:05Outer[Subtract, vect1,vect2,1](mind the fourth argument). – Henrik Schumacher Feb 15 '24 at 14:10Times,Plus, andListand packed vectors forvect1andvect2, the reason is special-case implementations using SIMD and vectorization. For optimal speed, it is also important that you don't overflow CPU caches (N=10000 too big?). (Listtakes twice as long usually because it constructs an array that has twice as many entries asTimesorPlus.) I think that a similar implementation forSubtractshould be possible (in, say, C), but that's beyond my skillset to do so. – Goofy Feb 15 '24 at 18:23