C++ How to inspect file Byte Order Mark in order to get if it is UTF-8?

myWallJSON picture myWallJSON · Feb 1, 2012 · Viewed 10.5k times · Source

I wonder how to inspect file Byte Order Mark in order to get if it is UTF-8 in C++?

Answer

Ian Clelland picture Ian Clelland · Feb 1, 2012

In general, you can't.

The presence of a Byte Order Mark is a very strong indication that the file you are reading is Unicode. If you are expecting a text file, and the first four bytes you receive are:

0x00, 0x00, 0xfe, 0xff -- The file is almost certainly UTF-32BE
0xff, 0xfe, 0x00, 0x00 -- The file is almost certainly UTF-32LE
0xfe, 0xff,  XX,   XX     -- The file is almost certainly UTF-16BE
0xff, 0xfe,  XX,   XX (but not 00, 00) -- The file is almost certainly UTF-16LE
0xef, 0xbb, 0xbf,  XX   -- The file is almost certainly UTF-8 With a BOM

But what about anything else? If the bytes you get are anything other than one of these five patterns, then you can't say for certain that your file is or is not UTF-8.

In fact, any text document containing only ASCII characters from 0x00 to 0x7f is a valid UTF-8 document, as well as being a plain ASCII document.

There are heuristics that can try to infer, based on the particular characters that are seen, whether a document is encoded in, say, ISO-8859-1, or UTF-8, or CP1252, but in general, the first two, three, or four bytes of a file are not enough to say whether what you are looking at is definitely UTF-8.