How do I convert a Windows-1251 text to something readable?

Dmitrii Pisarenko picture Dmitrii Pisarenko · Jul 30, 2015 · Viewed 7k times · Source

I have a string, which is returned by the Jericho HTML parser and contains some Russian text. According to source.getEncoding() and the header of the respective HTML file, the encoding is Windows-1251.

How can I convert this string to something readable?

I tried this:

import java.io.UnsupportedEncodingException;

public class Program {
    public void run() throws UnsupportedEncodingException {
        final String windows1251String = getWindows1251String();
        System.out.println("String (Windows-1251): " + windows1251String);
        final String readableString = convertString(windows1251String);
        System.out.println("String (converted): " + readableString);
    }
    private String convertString(String windows1251String) throws UnsupportedEncodingException {
        return new String(windows1251String.getBytes(), "UTF-8");
    }
    private String getWindows1251String() {
        final byte[] bytes = new byte[] {32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
        return new String(bytes);
    }
    public static void main(final String[] args) throws UnsupportedEncodingException {
        final Program program = new Program();
        program.run();
    }
}

The variable bytes contains the data shown in my debugger, it's the result of net.htmlparser.jericho.Element.getContent().toString().getBytes(). I just copy and pasted that array here.

This doesn't work - readableString contains garbage.

How can I fix it, i. e. make sure that the Windows-1251 string is decoded properly?

Update 1 (30.07.2015 12:45 MSK): When change the encoding in the call in convertString to Windows-1251, nothing changes. See the screenshot below.

Screenshot

Update 2: Another attempt:

Second screenshot

Update 3 (30.07.2015 14:38): The texts that I need to decode correspond to the texts in the drop-down list shown below.

Expected result

Update 4 (30.07.2015 14:41): The encoding detector (code see below) says that the encoding is not Windows-1251, but UTF-8.

public static String guessEncoding(byte[] bytes) {
    String DEFAULT_ENCODING = "UTF-8";
    org.mozilla.universalchardet.UniversalDetector detector =
        new org.mozilla.universalchardet.UniversalDetector(null);
    detector.handleData(bytes, 0, bytes.length);
    detector.dataEnd();
    String encoding = detector.getDetectedCharset();
    System.out.println("Detected encoding: " + encoding);
    detector.reset();
    if (encoding == null) {
        encoding = DEFAULT_ENCODING;
    }
    return encoding;
}

Answer

Rodney picture Rodney · Jul 30, 2015

(In the light of updates I deleted my original answer and started again)

The text which appears

пїЅпїЅпїЅпїЅпїЅпїЅ

is an accurate decoding of these byte values

-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67

(Padded at either end with 32, which is space.)

So either

1) The text is garbage or

2) The text is supposed to look like that or

3) The encoding is not Windows-1215

This line is notably wrong

return new String(windows1251String.getBytes(), "UTF-8");

Extracting the bytes out of a string and constructing a new string from that is not a way of "converting" between encodings. Both the input String and the output String use UTF-16 encoding internally (and you don't normally even need to know or care about that). The only times other encodings come into play are when text data is stored OUTSIDE of a string object - ie in your initial byte array. Conversion occurs when the String is constructed and then it is done. There is no conversion from one String type to another - they are all the same.

The fact that this

return new String(bytes);

does the same as this

return new String(bytes, "Windows-1251");

suggests that Windows-1251 is the platforms default encoding. (Which is further supported by your timezone being MSK)