As an extension to moyner's answer, the on-chip sqrt is usually an rsqrt, i.e. a reciprocal square root that computes $a \rightarrow 1/\sqrt{a}$. So if in your code you're only going to use $1/r$ (if you're doing molecular dynamics, you are), you can compute r = rsqrt(r2) directly and save yourself the division. The reason why rsqrt is computed instead of sqrt is that its Newton iteration has no divisions, only additions and multiplications.
As a side-note, divisions are also computed iteratively and are almost just as slow as rsqrt in hardware. If you're looking for efficiency, you're better off trying to remove superfluous divisions.
Some more modern architectures such as IBM's POWER architectures do not provide rsqrt per-se, but an estimate accurate to a few bits, e.g. FRSQRTE. When a user calls rsqrt, this generates an estimate and then one or two (as many as required) iterations of Newton's or Goldschmidt's algorithm using regular multiplications and additions. The advantage of this approach is that the iteration steps may be pipelined and interleaved with other instructions without blocking the FPU (for a very nice overview of this concept, albeit on older architectures, see Rolf Strebel's PhD Thesis).
For interaction potentials, the sqrt operation can be avoided entirely by using a polynomial interpolant of the potential function, but my own work (implemented in mdcore) in this area show that, at least on x86-type architectures, the sqrt instruction is fast enough.
Update
Since this answer seems to be getting quite a bit of attention, I would also like to address the second part of your question, i.e. is it really worth it to try to improve/eliminate basic operations such as sqrt?
In the context of Molecular Dynamics simulations, or any particle-based simulation with cutoff-limited interactions, there is a lot to be gained from better algorithms for neighbour finding. If you're using Cell lists, or anything similar, to find neighbours or create a Verlet list, you will be computing a large number of spurious pairwise distances. In the naive case, only 16% of particle pairs inspected will actually be within the cutoff distance of each other. Although no interaction is computed for such pairs, accessing the particle data and computing the spurious pairwise distance carries a large cost.
My own work in this area (here, here, and here), as well as that of others (e.g. here), show how these spurious computations can be avoided. These neighbour-finding algorithms even out-perform Verlet lists, as described here.
The point I want to emphasize is that although there may be some improvements to gain from better knowing/exploiting the underlying hardware architecture, there are also potentially larger gains to be had in re-thinking the higher-level algorithms.
rsqrtpsand AVXvrsqrtpsare also estimates, they get the first 11 to 12 bits correct and you should refine with a Newton iteration or two if you want more accuracy. These are 5/1 and 7/1 (latency/inverse throughput) instructions on Sandy Bridge (see Intel docs or Agner Fog's instruction tables which is comparable to multiplication. In contrast, the full accuracy(v)sqrtps(or double precision(v)sqrtpd) take 10-43/10-43 (see the instruction tables for details). – Jed Brown May 10 '12 at 13:47