Converting Mac Roman character to equivalent UTF-8

btschumy picture btschumy · Jul 10, 2013 · Viewed 8.1k times · Source

I have been given some HTML files that use the Mac OS Roman file encoding. The files have French text, but in an editor many of the diacritical chars look strange (i.e. non French)

Si cette option est sÈlectionnÈe, <removed> tentera de communiquer avec votre tÈlescope seulement ‡ líaide díun ...

The capital E with accent does display properly in the browser as é as do the other strange characters.

I also have some UTF-8 French files that look normal in an editor (é looks like é). What I'd like to do is convert all the Mac Roman files to UTF-8 for easier maintenance.

Simply changing the file encoding in the editor doesn't do this. The strange characters are still strange.

Short of making a conversion dictionary and doing a Find/Replace on all the files, is there a way to do this?

Answer

tchrist picture tchrist · Jul 12, 2013

If your editor isn’t showing it correctly when you specify the encoding, you have given it the wrong encoding. You need to figure what encoding you really have.

You appear to have a byte valued 0xE9 where you need a Unicode LATIN SMALL LETTER E WITH ACUTE character. A MacRoman 0xE9 byte is a LATIN CAPITAL LETTER E WITH GRAVE character, which is what your editor is displaying because you said it was MacRoman. But it is not.

However, Unicode code point U+00E9 is indeed LATIN SMALL LETTER E WITH ACUTE.

Therefore, it is not MacRoman that you have there, but almost certainly ISO-8859-1 or ISO-8859-15.

So use something like

$ iconv -f ISO-8859-1 -t UTF-8 < input.latin1 > output.utf8

to do the conversion.