How can I speed up line by line reading of an ASCII file? (C++)

Jon picture Jon · Mar 2, 2011 · Viewed 7k times · Source

Here's a bit of code that is a considerable bottleneck after doing some measuring:

//-----------------------------------------------------------------------------
//  Construct dictionary hash set from dictionary file
//-----------------------------------------------------------------------------
void constructDictionary(unordered_set<string> &dict)
{
    ifstream wordListFile;
    wordListFile.open("dictionary.txt");

    std::string word;
    while( wordListFile >> word )
    {
        if( !word.empty() )
        {
            dict.insert(word);
        }
    }

    wordListFile.close();
}

I'm reading in ~200,000 words and this takes about 240 ms on my machine. Is the use of ifstream here efficient? Can I do better? I'm reading about mmap() implementations but I'm not understanding them 100%. The input file is simply text strings with *nix line terminations.

EDIT: Follow-up question for the alternatives being suggested: Would any alternative (minus increasing the stream buffer sizes) imply that I write a parser that examines each character for new-lines? I kind of like the simple syntax of streams, but I can re-write something more nitty-gritty if I have to for speed. Reading the entire file in to memory is a viable option, it's only about 2mb.

EDIT #2: I've found that the slow down for me was due to the set insert, but for those who are still interested in speeding up line by line file IO, please read the answers here AND check out Matthieu M.'s continuation on the topic.

Answer

Cubbi picture Cubbi · Mar 2, 2011

Quick profiling on my system (linux-2.6.37, gcc-4.5.2, compiled with -O3) shows that I/O is not the bottleneck. Whether using fscanf into a char array followed by dict.insert() or operator>> as in your exact code, it takes about the same time (155 - 160 ms to read a 240k word file).

Replacing gcc's std::unordered_set with std::vector<std::string> in your code drops the execution time to 45 ms (fscanf) - 55 ms (operator>>) for me. Try to profile IO and set insertion separately.