Determining ISO-8859-1 vs US-ASCII charset

vikingsteve picture vikingsteve · Jun 10, 2015 · Viewed 10.8k times · Source

I am trying to determine whether to use

PrintWriter pw = new PrintWriter(outputFilename, "ISO-8859-1");

or

PrintWriter pw = new PrintWriter(outputFilename, "US-ASCII");

I was reading All about character sets to determine the character set of an example file which I must create in the same encoding via java code.

When my example file contains "European" letters (Norwegian: å ø æ), then the following command tells me the file encoding is "iso-8859-1"

file -bi example.txt

However, when I take a copy of the same example file and modify it to contain different data, without any Norwegian text (let's say, I replace "Bjørn" with "Bjorn"), then the same command tells me the file encoding is "us-ascii".

file -bi example-no-european-letters.txt

What does this mean? Is ISO-8859-1 in practise the same as US-ASCII if there are no "European" characters in it?

Should I just use a charset "ISO-8559-1" and everything will be ok?

Answer

Kayaman picture Kayaman · Jun 10, 2015

If the file contains only the 7-bit US-ASCII characters it can be read as US-ASCII. It doesn't tell anything about what was intended as the charset. It may be just a coincidence that there were no characters that would require a different coding.

ISO-8859-1 (and -15) is a common european encoding, able to encode äöåéü and other characters, the first 127 characters being the same as in US-ASCII (as often is, for convenience reasons).

However you can't just pick an encoding and assume that "everything will be OK". The very common UTF-8 encoding also contains the US-ASCII charset, but it will encode for example äöå characters as two bytes instead of ISO-8859-1's one byte.

TL;DR: Don't assume things with encodings. Find out what was intended and use that. If you can't find it out, observe the data to try to figure out what is a correct charset to use (as you noted yourself, multiple encodings may work at least temporarily).