Getting std :: ifstream to handle LF, CR, and CRLF?

Aaron McDaid picture Aaron McDaid · May 22, 2011 · Viewed 58k times · Source

Specifically I'm interested in istream& getline ( istream& is, string& str );. Is there an option to the ifstream constructor to tell it to convert all newline encodings to '\n' under the hood? I want to be able to call getline and have it gracefully handle all line endings.

Update: To clarify, I want to be able to write code that compiles almost anywhere, and will take input from almost anywhere. Including the rare files that have '\r' without '\n'. Minimizing inconvenience for any users of the software.

It's easy to workaround the issue, but I'm still curious as to the right way, in the standard, to flexibly handle all text file formats.

getline reads in a full line, up to a '\n', into a string. The '\n' is consumed from the stream, but getline doesn't include it in the string. That's fine so far, but there might be a '\r' just before the '\n' that gets included into the string.

There are three types of line endings seen in text files: '\n' is the conventional ending on Unix machines, '\r' was (I think) used on old Mac operating systems, and Windows uses a pair, '\r' following by '\n'.

The problem is that getline leaves the '\r' on the end of the string.

ifstream f("a_text_file_of_unknown_origin");
string line;
getline(f, line);
if(!f.fail()) { // a non-empty line was read
   // BUT, there might be an '\r' at the end now.
}

Edit Thanks to Neil for pointing out that f.good() isn't what I wanted. !f.fail() is what I want.

I can remove it manually myself (see edit of this question), which is easy for the Windows text files. But I'm worried that somebody will feed in a file containing only '\r'. In that case, I presume getline will consume the whole file, thinking that it is a single line!

.. and that's not even considering Unicode :-)

.. maybe Boost has a nice way to consume one line at a time from any text-file type?

Edit I'm using this, to handle the Windows files, but I still feel I shouldn't have to! And this won't fork for the '\r'-only files.

if(!line.empty() && *line.rbegin() == '\r') {
    line.erase( line.length()-1, 1);
}

Answer

Johan Råde picture Johan Råde · May 22, 2011

As Neil pointed out, "the C++ runtime should deal correctly with whatever the line ending convention is for your particular platform."

However, people do move text files between different platforms, so that is not good enough. Here is a function that handles all three line endings ("\r", "\n" and "\r\n"):

std::istream& safeGetline(std::istream& is, std::string& t)
{
    t.clear();

    // The characters in the stream are read one-by-one using a std::streambuf.
    // That is faster than reading them one-by-one using the std::istream.
    // Code that uses streambuf this way must be guarded by a sentry object.
    // The sentry object performs various tasks,
    // such as thread synchronization and updating the stream state.

    std::istream::sentry se(is, true);
    std::streambuf* sb = is.rdbuf();

    for(;;) {
        int c = sb->sbumpc();
        switch (c) {
        case '\n':
            return is;
        case '\r':
            if(sb->sgetc() == '\n')
                sb->sbumpc();
            return is;
        case std::streambuf::traits_type::eof():
            // Also handle the case when the last line has no line ending
            if(t.empty())
                is.setstate(std::ios::eofbit);
            return is;
        default:
            t += (char)c;
        }
    }
}

And here is a test program:

int main()
{
    std::string path = ...  // insert path to test file here

    std::ifstream ifs(path.c_str());
    if(!ifs) {
        std::cout << "Failed to open the file." << std::endl;
        return EXIT_FAILURE;
    }

    int n = 0;
    std::string t;
    while(!safeGetline(ifs, t).eof())
        ++n;
    std::cout << "The file contains " << n << " lines." << std::endl;
    return EXIT_SUCCESS;
}