Writing unicode to rtf file

Oglop picture Oglop · Oct 25, 2011 · Viewed 7.2k times · Source

I´m trying write strings in diffrent languages to a rtf file. I hav tried a few different things. I use japanese here as an example but it´s the same for other languages i have tried.

public void writeToFile(){

    String strJapanese = "日本語";
    DataOutputStream outStream;
    File file = new File("C:\\file.rtf");

    try{

        outStream = new DataOutputStream(new FileOutputStream(file));
        outStream.writeBytes(strJapanese);
        outStream.close();

    }catch (Exception e){
        System.out.println(e.toString());
    }
}

I alse have tried:

byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);

Or more specific:

byte[] b = strJapanese.getBytes("Shift-JIS");
String output = new String(b);

The output stream also has the writeUTF method:

outStream.writeUTF(strJapanese);

You can use the byte[] directly in the output stream with the write method. All of the above gives me garbled characters for everything except west european languages. To see if it works I have tried opening the result document in notepad++ and set the appropriate encoding. Also i have used OpenOffice where you get to choose encoding and font when opening the document.

If it does work but my computer can´t open it properly, is there a way to check that?

Answer

bobince picture bobince · Oct 25, 2011

DataOutputStream outStream;

You probably don't want a DataOutputStream for writing an RTF file. DataOutputStream is for writing binary structures to a file, but RTF is text-based. Typically an OutputStreamWriter, setting the appropriate charset in the constructor would be the way to write to text files.

outStream.writeBytes(strJapanese);

In particular this fails because writeBytes really does write bytes, even though you pass it a String. A much more appropriate datatype would have been byte[], but that's just one of the places where Java's handling of bytes vs chars is confusing. The way it converts your string to bytes is simply by taking the lower eight bits of each UTF-16 code unit, and throwing the rest away. This results in ISO-8859-1 encoding with garbled nonsense for all the characters that don't exist in ISO-8859-1.

byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);

This doesn't really do anything useful. You encode to UTF-8 bytes and than decode that back to a String using the default charset. It's almost always a mistake to touch the default charset as it is unpredictable over different machines.

outStream.writeUTF(strJapanese);

This would be a better stab at writing UTF-8, but it's still not quite right as it uses Java's bogus “modified UTF-8” encoding, and more importantly RTF files don't actually support UTF-8, and shouldn't really directly include any non-ASCII characters at all.

Traditionally non-ASCII characters from 128 upwards should be written as hex bytes escapes like \'80, and the encoding for them is specified, if it is at all, in font \fcharset and \cpg escapes that are very, very annoying to deal with, and don't offer UTF-8 as one of the options.

In more modern RTF, you get \u1234x escapes as in Dabbler's answer (+1). Each escape encodes one UTF-16 code unit, which corresponds to a Java char, so it's not too difficult to regex-replace all non-ASCII characters with their escaped variants.

This is supported by Word 97 and later but some other tools may ignore the Unicode and fall back to the x replacement character.

RTF is not a very nice format.