I have a string, which is returned by the Jericho HTML parser and contains some Russian text. According to source.getEncoding()
and the header of the respective HTML file, the encoding is Windows-1251.
How can I convert this string to something readable?
I tried this:
import java.io.UnsupportedEncodingException;
public class Program {
public void run() throws UnsupportedEncodingException {
final String windows1251String = getWindows1251String();
System.out.println("String (Windows-1251): " + windows1251String);
final String readableString = convertString(windows1251String);
System.out.println("String (converted): " + readableString);
}
private String convertString(String windows1251String) throws UnsupportedEncodingException {
return new String(windows1251String.getBytes(), "UTF-8");
}
private String getWindows1251String() {
final byte[] bytes = new byte[] {32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
return new String(bytes);
}
public static void main(final String[] args) throws UnsupportedEncodingException {
final Program program = new Program();
program.run();
}
}
The variable bytes
contains the data shown in my debugger, it's the result of net.htmlparser.jericho.Element.getContent().toString().getBytes()
. I just copy and pasted that array here.
This doesn't work - readableString
contains garbage.
How can I fix it, i. e. make sure that the Windows-1251 string is decoded properly?
Update 1 (30.07.2015 12:45 MSK): When change the encoding in the call in convertString
to Windows-1251
, nothing changes. See the screenshot below.
Update 2: Another attempt:
Update 3 (30.07.2015 14:38): The texts that I need to decode correspond to the texts in the drop-down list shown below.
Update 4 (30.07.2015 14:41): The encoding detector (code see below) says that the encoding is not Windows-1251
, but UTF-8
.
public static String guessEncoding(byte[] bytes) {
String DEFAULT_ENCODING = "UTF-8";
org.mozilla.universalchardet.UniversalDetector detector =
new org.mozilla.universalchardet.UniversalDetector(null);
detector.handleData(bytes, 0, bytes.length);
detector.dataEnd();
String encoding = detector.getDetectedCharset();
System.out.println("Detected encoding: " + encoding);
detector.reset();
if (encoding == null) {
encoding = DEFAULT_ENCODING;
}
return encoding;
}
(In the light of updates I deleted my original answer and started again)
The text which appears
пїЅпїЅпїЅпїЅпїЅпїЅ
is an accurate decoding of these byte values
-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67
(Padded at either end with 32, which is space.)
So either
1) The text is garbage or
2) The text is supposed to look like that or
3) The encoding is not Windows-1215
This line is notably wrong
return new String(windows1251String.getBytes(), "UTF-8");
Extracting the bytes out of a string and constructing a new string from that is not a way of "converting" between encodings. Both the input String and the output String use UTF-16 encoding internally (and you don't normally even need to know or care about that). The only times other encodings come into play are when text data is stored OUTSIDE of a string object - ie in your initial byte array. Conversion occurs when the String is constructed and then it is done. There is no conversion from one String type to another - they are all the same.
The fact that this
return new String(bytes);
does the same as this
return new String(bytes, "Windows-1251");
suggests that Windows-1251 is the platforms default encoding. (Which is further supported by your timezone being MSK)