Proper character encoding to display "”"?

Charles Zink picture Charles Zink · Jun 21, 2011 · Viewed 7.8k times · Source

I'm having some nasty character encoding problems that I just can't figure out.

Essentially, I'm screen scraping some HTML off of a site using PHP, then running it through PHP's DOMDocument to change out some URL's, etc., and when it's done, it outputs HTML with some weird things. Ex: where there should be an end quote, it puts out ”

I have the page's meta tag for charset set to utf-8 but then the ” characters are showing up as †on the site. I'm not sure if I just don't understand character encoding, or what.

Any suggestions on the best way to resolve this? Something client side with a meta tag, or some kind of server-side PHP conversion?

Answer

gavanon picture gavanon · Sep 11, 2014

Sometimes setting the charset in HTML or the response header isn't enough. If everything isn't setup for UTF-8 on your server, your text may get incorrectly converted somewhere along the way. You may need to enable UTF-8 encoding for both Apache and PHP right in their config files. (If you're not using Apache, try skipping that step.)

Apache UTF-8 setup:

Edit either your charset.conf (ideal), or httpd.conf file, by adding this line to the end:

AddDefaultCharset utf-8

(If you don't have access to Apache's config files, you can create a ".htaccess" file in your HTML's root directory with that same code.)

PHP UTF-8 setup:

Edit your php.ini file, searching for "default_charset", and change it to:

default_charset = "utf-8"

Restart Apache:

Depending on your server type, this command may do the trick via command line:

sudo service apache2 restart