Avoiding Calls to floor()

REM picture REM · Feb 28, 2010 · Viewed 10.8k times · Source

I am working on a piece of code where I need to deal with uvs (2D texture coordinates) that are not necessarily in the 0 to 1 range. As an example, sometimes I will get a uv with a u component that is 1.2. In order to handle this I am implementing a wrapping which causes tiling by doing the following:

u -= floor(u)
v -= floor(v)

Doing this causes 1.2 to become 0.2 which is the desired result. It also handles negative cases, such as -0.4 becoming 0.6.

However, these calls to floor are rather slow. I have profiled my application using Intel VTune and I am spending a huge amount of cycles just doing this floor operation.

Having done some background reading on the issue, I have come up with the following function which is a bit faster but still leaves a lot to be desired (I am still incurring type conversion penalties, etc).

int inline fasterfloor( const float x ) { return x > 0 ? (int) x : (int) x - 1; }

I have seen a few tricks that are accomplished with inline assembly but nothing that seems to work exactly correct or have any significant speed improvement.

Does anyone know any tricks for handling this kind of scenario?

Answer

AshleysBrain picture AshleysBrain · Feb 28, 2010

So you want a really fast float->int conversion? AFAIK int->float conversion is fast, but on at least MSVC++ a float->int conversion invokes a small helper function, ftol(), which does some complicated stuff to ensure a standards compliant conversion is done. If you don't need such strict conversion, you can do some assembly hackery, assuming you're on an x86-compatible CPU.

Here's a function for a fast float-to-int which rounds down, using MSVC++ inline assembly syntax (it should give you the right idea anyway):

inline int ftoi_fast(float f)
{
    int i;

    __asm
    {
        fld f
        fistp i
    }

    return i;
}

On MSVC++ 64-bit you'll need an external .asm file since the 64 bit compiler rejects inline assembly. That function basically uses the raw x87 FPU instructions for load float (fld) then store float as integer (fistp). (Note of warning: you can change the rounding mode used here by directly tweaking registers on the CPU, but don't do that, you'll break a lot of stuff, including MSVC's implementation of sin and cos!)

If you can assume SSE support on the CPU (or there's an easy way to make an SSE-supporting codepath) you can also try:

#include <emmintrin.h>

inline int ftoi_sse1(float f)
{
    return _mm_cvtt_ss2si(_mm_load_ss(&f));     // SSE1 instructions for float->int
}

...which is basically the same (load float then store as integer) but using SSE instructions, which are a bit faster.

One of those should cover the expensive float-to-int case, and any int-to-float conversions should still be cheap. Sorry to be Microsoft-specific here but this is where I've done similar performance work and I got big gains this way. If portability/other compilers are an issue you'll have to look at something else, but these functions compile to maybe two instructions taking <5 clocks, as opposed to a helper function that takes 100+ clocks.