Eclipse wrong Java properties UTF-8 encoding

BuZZ-dEE picture BuZZ-dEE · Jun 30, 2015 · Viewed 43.3k times · Source

I have a JavaEE project, in which I use message properties files. The encoding of those file is set to UTF-8. In the file I use the german umlauts like ä, ö, ü. The problem is, sometimes those characters are replaced with unicode like \uFFFD\uFFFD, but not for every character. Now, I have a case where ä and ü are both replaced with \uFFFD\uFFFD, but not for every occurring of ä and ü.

The Git diff shows me something like this:

 mail.adresses=E-Mail hinzufügen:
-mail.adresses.multiple=E-Mails durch Kommata getrennt hinzufügen.
+mail.adresses.multiple=E-Mails durch Kommata getrennt hinzuf\uFFFD\uFFFDgen.
 mail.title=Einladungs-E-Mail
 box.preview=Vorschau
 box.share.text=Sie können jetzt die ausgewählten Bilder mit Ihren Freunden teilen.
@@ -6880,7 +6880,7 @@ browser.cancel=Abbrechen
 browser.selectImage=übernehmen
 browser.starImage=merken
 browser.removeImage=Löschen
-browser.searchForSimilarImages=ähnliche
+browser.searchForSimilarImages=\uFFFD\uFFFDhnliche
 browser.clear_drop_box=löschen

Also, there are lines changed, which I have not touched. I don't understand why I get such a behavior. What could be the cause for the above problem?

My system:

  • Antergos / Arch Linux

    • System encoding UTF-8

      Python 3.5.0 (default, Sep 20 2015, 11:28:25) 
      [GCC 5.2.0] on linux
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import sys
      >>> sys.getdefaultencoding()
      'utf-8'
      
  • Eclipse Mars 1

    • Text file encoding UTF-8 ext file encoding
    • Properties file encoding UTF-8 Properties file encoding
  • Tomcat 8
  • Java JDK 8

If I use another Editor like Atom to edit those message properties files, I don't ran into this problem.

I also realized in a case, if I copy the original value browser.searchForSimilarImages=ähnliche from Git diff and replace the wrong value browser.searchForSimilarImages=\uFFFD\uFFFDhnliche in Eclipse with that, then I have the correct umlauts in the message properties file.

Answer

hagrawal picture hagrawal · Jun 30, 2015

Root cause:

By default ISO 8859-1 character encoding is used for Eclipse properties file (read here), so if the file contains any character beyond ISO 8859-1 then it will not be processed as expected.

Solution 1

If you use Eclipse then you will notice that it implicitly converts the special character into \uXXXX equivalent. Try copying

会意字 / 會意字

into a properties file opened in Eclipse.

EDIT: As per comment from OP

Update the encoding of your Eclipse as shown below. If you set encoding as UTF-32 then even you can see Chinese character, which you cannot see generally.

How to change Encoding of properties file in Eclipse: See this Eclipse Bugzilla bug for more details, which talks about several other possibilities and in the end suggest what I have highlighted below. enter image description here

Chinese characters can be seen in Eclipse after encoding is set properly: enter image description here

Solution 2

If above doesn't work consistently for you (it does work for me and I never see encoding issues) then try this using some Eclipse plugin which handles encoding of properties or other files. For example Eclipse ResourceBundle Editor or Extended Resource-Bundle editor

I would recommend using Eclipse ResourceBundle Editor.

Solution 3

Another possibility to change encoding of file is using Edit --> Set Encoding option. It really matters because it changes the default character set and file encoding. Play around with by changing encoding using Edit --> Set Encoding option and do following Java sysout System.out.println("Default Charset=" + Charset.defaultCharset()); and System.out.println(System.getProperty("file.encoding"));

enter image description here


As an aside: 1

Process the properties file to have content with ISO 8859-1 character encoding by using native2ascii - Native-to-ASCII Converter

What native2ascii does: It converts all the non-ISO 8859-1 character in their equivalent \uXXXX. This is a good tool because you need not to search the \uXXXX equivalent of special character.

Usage for UTF-8: native2ascii -encoding utf8 e:\a.txt e:\b.txt


As an aside: 2

Every computer program whether an IDE, application server, web server, browser, etc. understands only bits, so it need to know how to interpret the bits to make expected sense out of it because depending upon encoding used, same bits can represent different characters. And that's where "Encoding" comes into picture by giving a unique identifier to represent a character so that all computer programs, diverse OS etc. knows exact right way to interpret it.

So, if you have written into a file using some encoding scheme, lets say UTF-8, and then reading using any editor but running with encoding scheme as UTF-8 then you can expect to get correct display.

Please do read my this answer to get more details but from browser-server perspective.