I need to read huge 35G file from disc line by line in C++. Currently I do it the following way:
ifstream infile("myfile.txt");
string line;
while (true) {
if (!getline(infile, line)) break;
long linepos = infile.tellg();
process(line,linepos);
}
But it gives me about 2MB/sec performance, though file manager copies the file with 100Mb/s speed. I guess that getline()
is not doing buffering correctly. Please propose some sort of buffered line-by-line reading approach.
UPD: process() is not a bottleneck, code without process() works with the same speed.
You won't get anywhere close to line speed with the standard IO streams. Buffering or not, pretty much ANY parsing will kill your speed by orders of magnitude. I did experiments on datafiles composed of two ints and a double per line (Ivy Bridge chip, SSD):
f >> i1 >> i2 >> d
) is faster than a getline
into a string followed by a sstringstream
parse.fscanf
get about 40 MB/s.getline
with no parsing: 180 MB/s.fread
: 500-800 MB/s (depending on whether or not the file was cached by the OS).I/O is not the bottleneck, parsing is. In other words, your process
is likely your slow point.
So I wrote a parallel parser. It's composed of tasks (using a TBB pipeline):
fread
large chunks (one such task at a time)I can have unlimited parsing tasks because my data is unordered anyway. If yours isn't then this might not be worth it to you. This approach gets me about 100 MB/s on an 4-core IvyBridge chip.