I am trying to determine whether to use
PrintWriter pw = new PrintWriter(outputFilename, "ISO-8859-1");
or
PrintWriter pw = new PrintWriter(outputFilename, "US-ASCII");
I was reading All about character sets to determine the character set of an example file which I must create in the same encoding via java code.
When my example file contains "European" letters (Norwegian: å ø æ), then the following command tells me the file encoding is "iso-8859-1"
file -bi example.txt
However, when I take a copy of the same example file and modify it to contain different data, without any Norwegian text (let's say, I replace "Bjørn" with "Bjorn"), then the same command tells me the file encoding is "us-ascii".
file -bi example-no-european-letters.txt
What does this mean? Is ISO-8859-1 in practise the same as US-ASCII if there are no "European" characters in it?
Should I just use a charset "ISO-8559-1" and everything will be ok?
If the file contains only the 7-bit US-ASCII
characters it can be read as US-ASCII
. It doesn't tell anything about what was intended as the charset. It may be just a coincidence that there were no characters that would require a different coding.
ISO-8859-1 (and -15) is a common european encoding, able to encode äöåéü and other characters, the first 127 characters being the same as in US-ASCII (as often is, for convenience reasons).
However you can't just pick an encoding and assume that "everything will be OK". The very common UTF-8 encoding also contains the US-ASCII charset, but it will encode for example äöå
characters as two bytes instead of ISO-8859-1's one byte.
TL;DR: Don't assume things with encodings. Find out what was intended and use that. If you can't find it out, observe the data to try to figure out what is a correct charset to use (as you noted yourself, multiple encodings may work at least temporarily).