Detect charset of string in PHP (UTF-8 or Windows-1256)

Mohamed Atef picture Mohamed Atef · Mar 3, 2013 · Viewed 21.3k times · Source


I'm working on script based on "Simple HTML DOM" and I want to detect string's charset after getting inner text of URL to convert it to "UTF-8" using iconv().
I've tried a lot of things but non of them work with Windows-1256.
What I've tried:-

mb_detect_encoding($content) detects Windows-1256 as UTF-8
mb_detect_encoding($content, "windows-1256") gives an error Illegal argument

function is_utf8($string) {   
  return preg_match('%^(?:  
  [\x09\x0A\x0D\x20-\x7E] # ASCII  
  | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte  
  | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs  
  | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte  
  | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates  
  | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3  
  | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15  
  | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16  
  )*$%xs', $string);
}

This function returns "0" if not UTF-8 but when string is UTF-8 it returns "Page can not be found". I'm not sure why!
My code is:

$html = file_get_html($url);
foreach($html->find('div[id=content]') as $element) {
  $content = $element->innertext;
  #Detect charset encoding of $content
}

URLs I'm working with:
UTF-8: http://www.masrawy.com/news/Egypt/Politics/2013/March/3/5541050.aspx
Windws-1256: http://www.youm7.com//News.asp?NewsID=965545

Answer

Mark Ormston picture Mark Ormston · Mar 3, 2013

Have you tried using

function is_utf8($string) {
  return (mb_detect_encoding($string, 'UTF-8', true) == 'UTF-8');
}

This works for me on the URLs you're specifying.

Also, I had the masrawy.com site CONSTANTLY fail to load (perhaps why you might be seeing "Page can not be found") while testing a few different options...

Oddly enough, trying to use the regex like you have caused PHP to completely commit suicide for my Windows install, taking Apache down with it.