How to parse space-separated floats in C++ quickly?

OopsUser picture OopsUser · Jul 4, 2013 · Viewed 19.6k times · Source

I have a file with millions of lines, each line has 3 floats separated by spaces. It takes a lot of time to read the file, so I tried to read them using memory mapped files only to find out that the problem is not with the speed of IO but with the speed of the parsing.

My current parsing is to take the stream (called file) and do the following

float x,y,z;
file >> x >> y >> z;

Someone in Stack Overflow recommended to use Boost.Spirit but I couldn't find any simple tutorial to explain how to use it.

I'm trying to find a simple and efficient way to parse a line that looks like this:

"134.32 3545.87 3425"

I will really appreciate some help. I wanted to use strtok to split it, but I don't know how to convert strings to floats, and I'm not quite sure it's the best way.

I don't mind if the solution will be Boost or not. I don't mind if it won't be the most efficient solution ever, but I'm sure that it is possible to double the speed.

Thanks in advance.

Answer

sehe picture sehe · Jul 5, 2013

UPDATE

Since Spirit X3 is available for testing, I've updated the benchmarks. Meanwhile I've used Nonius to get statistically sound benchmarks.

All charts below are available interactive online

Benchmark CMake project + testdata used is on github: https://github.com/sehe/bench_float_parsing

enter image description here

Summary:

Spirit parsers are fastest. If you can use C++14 consider the experimental version Spirit X3:

enter image description here

The above is measures using memory mapped files. Using IOstreams, it will be slower accross the board,

enter image description here

but not as slow as scanf using C/POSIX FILE* function calls:

enter image description here


What follows is parts from the OLD answer


I implemented the Spirit version, and ran a benchmark comparing to the other suggested answers.

Here's my results, all tests run on the same body of input (515Mb of input.txt). See below for exact specs.


(wall clock time in seconds, average of 2+ runs)

To my own surprise, Boost Spirit turns out to be fastest, and most elegant:

  • handles/reports errors
  • supports +/-Inf and NaN and variable whitespace
  • no problems at all detecting the end of input (as opposed to the other mmap answer)
  • looks nice:

    bool ok = phrase_parse(f,l,               // source iterators
         (double_ > double_ > double_) % eol, // grammar
         blank,                               // skipper
         data);                               // output attribute
    

Note that boost::spirit::istreambuf_iterator was unspeakably much slower (15s+). I hope this helps!

Benchmark details

All parsing done into vector of struct float3 { float x,y,z; }.

Generate input file using

od -f -A none --width=12 /dev/urandom | head -n 11000000

This results in a 515Mb file containing data like

     -2627.0056   -1.967235e-12  -2.2784738e+33
  -1.0664798e-27  -4.6421956e-23   -6.917859e+20
  -1.1080849e+36   2.8909405e-33   1.7888695e-12
  -7.1663235e+33  -1.0840628e+36   1.5343362e-12
  -3.1773715e-17  -6.3655537e-22   -8.797282e+31
    9.781095e+19   1.7378472e-37        63825084
  -1.2139188e+09  -5.2464635e-05  -2.1235992e-38
   3.0109424e+08   5.3939846e+30  -6.6146894e-20

Compile the program using:

g++ -std=c++0x -g -O3 -isystem -march=native test.cpp -o test -lboost_filesystem -lboost_iostreams

Measure wall clock time using

time ./test < input.txt 

Environment:

  • Linux desktop 4.2.0-42-generic #49-Ubuntu SMP x86_64
  • Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
  • 32GiB RAM

Full Code

Full code to the old benchmark is in the edit history of this post, the newest version is on github