Adding UTF-8 BOM to string/Blob

kay picture kay · Jul 26, 2013 · Viewed 48.7k times · Source

I need to add a UTF-8 byte-order-mark to generated text data on client side. How do I do that?

Using new Blob(['\xEF\xBB\xBF' + content]) yields '"my data"', of course.

Neither did '\uBBEF\x22BF' work (with '\x22' == '"' being the next character in content).

Is it possible to prepend the UTF-8 BOM in JavaScript to a generated text?

Yes, I really do need the UTF-8 BOM in this case.

Answer

Erik Töyrä Silfverswärd picture Erik Töyrä Silfverswärd · Jul 26, 2013

Prepend \ufeff to the string. See http://msdn.microsoft.com/en-us/library/ie/2yfce773(v=vs.94).aspx

See discussion between @jeff-fischer and @casey for details on UTF-8 and UTF-16 and the BOM. What actually makes the above work is that the string \ufeff is always used to represent the BOM, regardless of UTF-8 or UTF-16 being used.

See p.36 in The Unicode Standard 5.0, Chapter 2 for a detailed explanation. A quote from that page

The endian order entry for UTF-8 in Table 2-4 is marked N/A because UTF-8 code units are 8 bits in size, and the usual machine issues of endian order for larger code units do not apply. The serialized order of the bytes must not depart from the order defined by the UTF- 8 encoding form. Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.