Is it possible to roll a significantly faster version of sqrt

Mr. Boy picture Mr. Boy · Apr 14, 2010 · Viewed 37.8k times · Source

In an app I'm profiling, I found that in some scenarios this function is able to take over 10% of total execution time.

I've seen discussion over the years of faster sqrt implementations using sneaky floating-point trickery, but I don't know if such things are outdated on modern CPUs.

MSVC++ 2008 compiler is being used, for reference... though I'd assume sqrt is not going to add much overhead though.

See also here for similar discussion on modf function.

EDIT: for reference, this is one widely-used method, but is it actually much quicker? How many cycles is SQRT anyway these days?

Answer

James picture James · Apr 14, 2010

Yes, it is possible even without trickery:

  1. sacrifice accuracy for speed: the sqrt algorithm is iterative, re-implement with fewer iterations.

  2. lookup tables: either just for the start point of the iteration, or combined with interpolation to get you all the way there.

  3. caching: are you always sqrting the same limited set of values? if so, caching can work well. I've found this useful in graphics applications where the same thing is being calculated for lots of shapes the same size, so results can be usefully cached.


Hello from 11 years in the future.

Considering this still gets occasional votes, I thought I'd add a note about performance, which now even more than then is dramatically limited by memory accesses. You absolutely must use a realistic benchmark (ideally, your whole application) when optimising something like this - the memory access patterns of your application will have a dramatic effect on solutions like lookup tables and caches, and just comparing 'cycles' for your optimised version will lead you wildly astray: it is also very difficult to assign program time to individual instructions, and your profiling tool may mislead you here.

  1. On a related note, consider using simd/vectorised instructions for calculating square roots, like _mm512_sqrt_ps or similar, if they suit your use case.

  2. Take a look at section 15.12.3 of intel's optimisation reference manual, which describes approximation methods, with vectorised instructions, which would probably translate pretty well to other architectures too.