Reliance on default encoding, what should I use and why?

Nikolas picture Nikolas · Mar 1, 2014 · Viewed 40k times · Source

FindBugs reports a bug:

Reliance on default encoding Found a call to a method which will perform a byte to String (or String to byte) conversion, and will assume that the default platform encoding is suitable. This will cause the application behaviour to vary between platforms. Use an alternative API and specify a charset name or Charset object explicitly.

I used FileReader like this (just a piece of code):

public ArrayList<String> getValuesFromFile(File file){
    String line;
    StringTokenizer token;
    ArrayList<String> list = null;
    BufferedReader br = null;
    try {
        br = new BufferedReader(new FileReader(file));
        list = new ArrayList<String>();
        while ((line = br.readLine())!=null){
            token = new StringTokenizer(line);
            token.nextToken();
            list.add(token.nextToken());
    ...

To correct the bug I need to change

br = new BufferedReader(new FileReader(file));

to

br = new BufferedReader(new InputStreamReader(new FileInputStream(file), Charset.defaultCharset()));

And when I use PrintWriter the same error occurred. So now I have a question. When I can (should) use FileReader and PrintWriter, if it's not good practice rely on default encoding? And the second question is to properly use Charset.defaultCharset ()? I decided use this method for automatically defining charset of the user's OS.

Answer

McDowell picture McDowell · Mar 1, 2014

Ideally, it should be:

try (InputStream in = new FileInputStream(file);
     Reader reader = new InputStreamReader(in, StandardCharsets.UTF_8);
     BufferedReader br = new BufferedReader(reader)) {

...or:

try (BufferedReader br = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {

...assuming the file is encoded as UTF-8.

Pretty much every encoding that isn't a Unicode Transformation Format is obsolete for natural language data. There are languages you cannot support without Unicode.