I have a file with millions of lines, each line has 3 floats separated by spaces. It takes a lot of time to read the file, so I tried to read them using memory mapped files only to find out that the problem is not with the speed of IO but with the speed of the parsing.
My current parsing is to take the stream (called file) and do the following
float x,y,z;
file >> x >> y >> z;
Someone in Stack Overflow recommended to use Boost.Spirit but I couldn't find any simple tutorial to explain how to use it.
I'm trying to find a simple and efficient way to parse a line that looks like this:
"134.32 3545.87 3425"
I will really appreciate some help. I wanted to use strtok to split it, but I don't know how to convert strings to floats, and I'm not quite sure it's the best way.
I don't mind if the solution will be Boost or not. I don't mind if it won't be the most efficient solution ever, but I'm sure that it is possible to double the speed.
Thanks in advance.
UPDATE
Since Spirit X3 is available for testing, I've updated the benchmarks. Meanwhile I've used Nonius to get statistically sound benchmarks.
All charts below are available interactive online
Benchmark CMake project + testdata used is on github: https://github.com/sehe/bench_float_parsing
Spirit parsers are fastest. If you can use C++14 consider the experimental version Spirit X3:
The above is measures using memory mapped files. Using IOstreams, it will be slower accross the board,
but not as slow as scanf
using C/POSIX FILE*
function calls:
What follows is parts from the OLD answer
I implemented the Spirit version, and ran a benchmark comparing to the other suggested answers.
Here's my results, all tests run on the same body of input (515Mb of
input.txt
). See below for exact specs.
(wall clock time in seconds, average of 2+ runs)To my own surprise, Boost Spirit turns out to be fastest, and most elegant:
- handles/reports errors
- supports +/-Inf and NaN and variable whitespace
- no problems at all detecting the end of input (as opposed to the other mmap answer)
looks nice:
bool ok = phrase_parse(f,l, // source iterators (double_ > double_ > double_) % eol, // grammar blank, // skipper data); // output attribute
Note that
boost::spirit::istreambuf_iterator
was unspeakably much slower (15s+). I hope this helps!Benchmark details
All parsing done into
vector
ofstruct float3 { float x,y,z; }
.Generate input file using
od -f -A none --width=12 /dev/urandom | head -n 11000000
This results in a 515Mb file containing data like
-2627.0056 -1.967235e-12 -2.2784738e+33 -1.0664798e-27 -4.6421956e-23 -6.917859e+20 -1.1080849e+36 2.8909405e-33 1.7888695e-12 -7.1663235e+33 -1.0840628e+36 1.5343362e-12 -3.1773715e-17 -6.3655537e-22 -8.797282e+31 9.781095e+19 1.7378472e-37 63825084 -1.2139188e+09 -5.2464635e-05 -2.1235992e-38 3.0109424e+08 5.3939846e+30 -6.6146894e-20
Compile the program using:
g++ -std=c++0x -g -O3 -isystem -march=native test.cpp -o test -lboost_filesystem -lboost_iostreams
Measure wall clock time using
time ./test < input.txt
Full code to the old benchmark is in the edit history of this post, the newest version is on github