What is the computational cost of $\sqrt{x}$ in standard libraries?

Question

One of the major issues that we have to deal with in molecular simulations is the calculation of distance-dependent forces. If we can restrict the force and distance functions to have even powers of the separation distance $r$, then we can just compute the square of the distance $r^2 = {\bf r \cdot r}$ and not have to worry about $r$. If there are odd powers, however, then we need to deal with $r = \sqrt{r^2}$.

My question is: how expensive is computing $\sqrt{x}$ as implemented in the libraries of common languages (C/C++, Fortran, Python), etc.? Is there really a lot of performance improvements to be had by hand-tuning the code for specific architectures?

score 43 · Accepted Answer · edited Apr 13 '17 at 12:53

As an extension to moyner's answer, the on-chip sqrt is usually an rsqrt, i.e. a reciprocal square root that computes $a \rightarrow 1/\sqrt{a}$. So if in your code you're only going to use $1/r$ (if you're doing molecular dynamics, you are), you can compute r = rsqrt(r2) directly and save yourself the division. The reason why rsqrt is computed instead of sqrt is that its Newton iteration has no divisions, only additions and multiplications.

As a side-note, divisions are also computed iteratively and are almost just as slow as rsqrt in hardware. If you're looking for efficiency, you're better off trying to remove superfluous divisions.

Some more modern architectures such as IBM's POWER architectures do not provide rsqrt per-se, but an estimate accurate to a few bits, e.g. FRSQRTE. When a user calls rsqrt, this generates an estimate and then one or two (as many as required) iterations of Newton's or Goldschmidt's algorithm using regular multiplications and additions. The advantage of this approach is that the iteration steps may be pipelined and interleaved with other instructions without blocking the FPU (for a very nice overview of this concept, albeit on older architectures, see Rolf Strebel's PhD Thesis).

For interaction potentials, the sqrt operation can be avoided entirely by using a polynomial interpolant of the potential function, but my own work (implemented in mdcore) in this area show that, at least on x86-type architectures, the sqrt instruction is fast enough.

Update

Since this answer seems to be getting quite a bit of attention, I would also like to address the second part of your question, i.e. is it really worth it to try to improve/eliminate basic operations such as sqrt?

In the context of Molecular Dynamics simulations, or any particle-based simulation with cutoff-limited interactions, there is a lot to be gained from better algorithms for neighbour finding. If you're using Cell lists, or anything similar, to find neighbours or create a Verlet list, you will be computing a large number of spurious pairwise distances. In the naive case, only 16% of particle pairs inspected will actually be within the cutoff distance of each other. Although no interaction is computed for such pairs, accessing the particle data and computing the spurious pairwise distance carries a large cost.

My own work in this area (here, here, and here), as well as that of others (e.g. here), show how these spurious computations can be avoided. These neighbour-finding algorithms even out-perform Verlet lists, as described here.

The point I want to emphasize is that although there may be some improvements to gain from better knowing/exploiting the underlying hardware architecture, there are also potentially larger gains to be had in re-thinking the higher-level algorithms.

SSE rsqrtps and AVX vrsqrtps are also estimates, they get the first 11 to 12 bits correct and you should refine with a Newton iteration or two if you want more accuracy. These are 5/1 and 7/1 (latency/inverse throughput) instructions on Sandy Bridge (see Intel docs or Agner Fog's instruction tables which is comparable to multiplication. In contrast, the full accuracy (v)sqrtps (or double precision (v)sqrtpd) take 10-43/10-43 (see the instruction tables for details). — Jed Brown, May 10 '12 at 13:47
@JedBrown: Thanks for pointing that out! I had forgotten that SSE and its extensions provide this too. — Pedro, May 10 '12 at 14:05

score 16 · Answer 2 · answered May 10 '12 at 07:55

The square root is implemented in hardware on most processors, that is, there are specific assembly instructions and the performance should be comparable in most languages because it is very hard to muck up the implementation. You will probably never be able to beat the FSQRT instruction, since it was designed by some smart hardware designer.

How it is implemented in hardware may vary, but it is probably some kind of fixed point iteration, for example Newton-Raphson's method which does a specific number of iterates until the number of digits required are computed. Iterative methods in hardware are in general much slower than other operations, since several cycles have to be completed before the result is ready.

There are also some Streaming SIMD Instructions which can be used on the XMM registers for fast vector computations found here. These registers are fairly small, but if you have a known number of coordinates (say, a three dimensional Cartesian coordinate system) they can be quite a bit faster.

If your language is low level enough, you could always typecast to a lower precision or use a lower precision number for your coordinates. Single precision is often more than good enough, and from what I remember will be faster when computing square roots since the iterations can be terminated earlier.

It should be easy enough to benchmark different languages: Just write a long series of random numbers to a file, load it using different languages and then time the square roots.

score 0 · Answer 3 · answered Oct 09 '13 at 22:13

There can be performance enhancements, but first one should profile to know that computing the reciprocal of the sqrt is the bottle-neck (and not, say, loading the positions and saving the forces).

The GROMACS MD project grew out of an idea to exploit the details of the IEEE floating-point format to seed a Newton-Raphson iteration scheme for computing an acceptable approximation to the reciprocal of the square root (see Appendix B.3 of http://www.gromacs.org/Documentation/Manual), but there are no HPC CPUs in use where GROMACS still uses this idea.

What is the computational cost of $\sqrt{x}$ in standard libraries?

3 Answers3