read huge text file line by line in C++ with buffering

Stepan Yakovenko picture Stepan Yakovenko · Jul 20, 2014 · Viewed 11.4k times · Source

I need to read huge 35G file from disc line by line in C++. Currently I do it the following way:

ifstream infile("myfile.txt");
string line;
while (true) {
    if (!getline(infile, line)) break;
    long linepos = infile.tellg();
    process(line,linepos);
}

But it gives me about 2MB/sec performance, though file manager copies the file with 100Mb/s speed. I guess that getline() is not doing buffering correctly. Please propose some sort of buffered line-by-line reading approach.

UPD: process() is not a bottleneck, code without process() works with the same speed.

Answer

Adam picture Adam · Jul 20, 2014

You won't get anywhere close to line speed with the standard IO streams. Buffering or not, pretty much ANY parsing will kill your speed by orders of magnitude. I did experiments on datafiles composed of two ints and a double per line (Ivy Bridge chip, SSD):

  • IO streams in various combinations: ~10 MB/s. Pure parsing (f >> i1 >> i2 >> d) is faster than a getline into a string followed by a sstringstream parse.
  • C file operations like fscanf get about 40 MB/s.
  • getline with no parsing: 180 MB/s.
  • fread: 500-800 MB/s (depending on whether or not the file was cached by the OS).

I/O is not the bottleneck, parsing is. In other words, your process is likely your slow point.

So I wrote a parallel parser. It's composed of tasks (using a TBB pipeline):

  1. fread large chunks (one such task at a time)
  2. re-arrange chunks such that a line is not split between chunks (one such task at a time)
  3. parse chunk (many such tasks)

I can have unlimited parsing tasks because my data is unordered anyway. If yours isn't then this might not be worth it to you. This approach gets me about 100 MB/s on an 4-core IvyBridge chip.