first of all i'd like to say i've read the other post regarding php's mb_detect_encoding at Strange behaviour of mb_detect_order() in PHP. Which definitely reconfirm what i'd learn through trail and error. however there are still a few things that confusese me.
I'm building a html scrapers of mostly english sites that collects data and stores it into UTF-8 XML. I ran into a problem where a page self declares ISO-8859-1 charset, but it contains characters unique to Windows-1252. specifically the right single quote (’) 0x92. As I understand it, windows-1252 is a superset of iso-8859-1, which prompts me to think why bother using utf8_encode() at all ? why not just use iconv('Windows-1252', 'UTF-8', $str) in place of utf8_encode() since whatever is represented in iso-8859-1 would get converted as well as characters unique to windows-1252 (ie. €‚ƒ‘ ’ “ ”)
Also
$ansi = "€";//euro mark, the code file itself is in ansi
$detected = mb_detect_encoding($ansi, "WINDOWS-1252");// $detected == "Windows-1252"
$detected = mb_detect_encoding('a'.$ansi, "WINDOWS-1252");// $detected == FALSE
$detected = mb_detect_encoding($ansi.'a', "WINDOWS-1252");// $detected == "Windows-1252"
$detected = mb_detect_encoding($ansi.'a', "WINDOWS-1252",TRUE);// $detected == FALSE
why does this happen ? if first character in string is not windows-1252, even though the rest of it is, it fails ? Doesn't this behavior make it pretty useless ? as far as distinguishing iso-8859-1 and windows-1252
the other thing that was confusing to me was, say I want to detect charset between ASCII, ISO-8859-1, windows-1252, UTF-8. Is it possible to detect strings in such a way that gives me the lowest ranking set ? (ie.
$ascii = "123"; // desired detect result == 'ASCII'
$iso = "é".$ascii; // desired detect result == 'ISO-8859-1'
$ansi = "€".$iso; // desired detect result == 'Windows-1252'
$utf8 = file_get_contents('utf8.txt', true);//$utf8 == '你好123é€', desired detect result == 'UTF-8'
shouldn't my $detect_order = array('ASCII', 'ISO-8859-1', 'Windows-1252','UTF-8'); I know this is incorrect as it gave me the following results
$ascii == 'ASCII'
$iso == 'ISO-8859-1'
$ansi == 'ISO-8859-1'
$utf8 == 'ISO-8859-1'
why is my detect order of ('ASCII', 'ISO-8859-1', 'Windows-1252','UTF-8') wrong for what I want to get ?
the closest desired return value i got was
$ascii == 'ASCII'
$iso == 'ISO-8859-1'
$ansi == 'ISO-8859-1'
$utf8 == 'UTF-8'
both of the following mb_detect_order array gave me the above values
$detect_order = array('ASCII', 'UTF-8', 'Windows-1252', 'ISO-8859-1');
$detect_order = array('ASCII', 'UTF-8', 'ISO-8859-1', 'Windows-1252');
this is confusing the heck out of me !
phew, can someone shed some light on this ? thanks alot appreciated it !
It's a known bug.
Windows-1251
and Windows-1252
will only succeed if the entire
string consists of high-byte characters in a certain range. That means
you'll never get the right conversion because the text will appear as
ISO-8859-1
even if it is Windows-1252
.
I ran into this problem converting from LATIN1
to UTF-8
. I had many contents pasted from Microsoft Word and stored in a VARCHAR
field using LATIN1
charset of a MySQL table. As you probably know Word converts apostrophes and quotes to smart apostrophes and curly quotes. None of them would display on screen, because those chars weren't properly converted. The text was always identified as ISO-8859-1
. To solve the problem I forced the conversion from Windows-1252
to UTF-8
and both, apostrophes and quotes (and other characters) were properly converted.