Is there a fast fabsf replacement for "float" in C++?

Vojtěch Melda Meluzín picture Vojtěch Melda Meluzín · May 5, 2014 · Viewed 7.2k times · Source

I'm just doing some benchmarking and found out that fabsf() is often like 10x slower than fabs(). So I disassembled it and it turns out the double version is using fabs instruction, float version is not. Can this be improved? This is faster, but not so much and I'm afraid it may not work, it's a little too lowlevel:

float mabs(float i)
{
    (*reinterpret_cast<MUINT32*>(&i)) &= 0x7fffffff;
    return i;
}

Edit: Sorry forgot about the compiler - I still use the good old VS2005, no special libs.

Answer

rubenvb picture rubenvb · May 5, 2014

You can easily test different possibilities using the code below. It essentially tests your bitfiddling against naive template abs, and std::abs. Not surprisingly, naive template abs wins. Well, kind of surprisingly it wins. I'd expect std::abs to be equally fast. Note that -O3 actually makes things slower (at least on coliru).

Coliru's host system shows these timings:

random number generation: 4240 ms
naive template abs: 190 ms
ugly bitfiddling abs: 241 ms
std::abs: 204 ms
::fabsf: 202 ms

And these timings for a Virtualbox VM running Arch with GCC 4.9 on a Core i7:

random number generation: 1453 ms
naive template abs: 73 ms
ugly bitfiddling abs: 97 ms
std::abs: 57 ms
::fabsf: 80 ms

And these timings on MSVS2013 (Windows 7 x64):

random number generation: 671 ms
naive template abs: 59 ms
ugly bitfiddling abs: 129 ms
std::abs: 109 ms
::fabsf: 109 ms

If I haven't made some blatantly obvious mistake in this benchmark code (don't shoot me over it, I wrote this up in about 2 minutes), I'd say just use std::abs, or the template version if that turns out to be slightly faster for you.


The code:

#include <algorithm>
#include <cmath>
#include <cstdint>
#include <cstdlib>
#include <chrono>
#include <iostream>
#include <random>
#include <vector>

#include <math.h>

using Clock = std::chrono::high_resolution_clock;
using milliseconds = std::chrono::milliseconds;

template<typename T>
T abs_template(T t)
{
  return t>0 ? t : -t;
}

float abs_ugly(float f)
{
  (*reinterpret_cast<std::uint32_t*>(&f)) &= 0x7fffffff;
  return f;
}

int main()
{
  std::random_device rd;
  std::mt19937 mersenne(rd());
  std::uniform_real_distribution<> dist(-std::numeric_limits<float>::lowest(), std::numeric_limits<float>::max());

  std::vector<float> v(100000000);

  Clock::time_point t0 = Clock::now();

  std::generate(std::begin(v), std::end(v), [&dist, &mersenne]() { return dist(mersenne); });

  Clock::time_point trand = Clock::now();

  volatile float temp;
  for (float f : v)
    temp = abs_template(f);

  Clock::time_point ttemplate = Clock::now();

  for (float f : v)
    temp = abs_ugly(f);

  Clock::time_point tugly = Clock::now();

  for (float f : v)
    temp = std::abs(f);

  Clock::time_point tstd = Clock::now();

  for (float f : v)
    temp = ::fabsf(f);

  Clock::time_point tfabsf = Clock::now();

  milliseconds random_time = std::chrono::duration_cast<milliseconds>(trand - t0);
  milliseconds template_time = std::chrono::duration_cast<milliseconds>(ttemplate - trand);
  milliseconds ugly_time = std::chrono::duration_cast<milliseconds>(tugly - ttemplate);
  milliseconds std_time = std::chrono::duration_cast<milliseconds>(tstd - tugly);
  milliseconds c_time = std::chrono::duration_cast<milliseconds>(tfabsf - tstd);
  std::cout << "random number generation: " << random_time.count() << " ms\n"
    << "naive template abs: " << template_time.count() << " ms\n"
    << "ugly bitfiddling abs: " << ugly_time.count() << " ms\n"
    << "std::abs: " << std_time.count() << " ms\n"
    << "::fabsf: " << c_time.count() << " ms\n";
}

Oh, and to answer your actual question: if the compiler can't generate more efficient code, I doubt there is a faster way save for micro-optimized assembly, especially for elementary operations such as this.