I'm having some nasty character encoding problems that I just can't figure out.
Essentially, I'm screen scraping some HTML off of a site using PHP, then running it through PHP's DOMDocument to change out some URL's, etc., and when it's done, it outputs HTML with some weird things. Ex: where there should be an end quote, it puts out ”
I have the page's meta tag for charset set to utf-8
but then the ”
characters are showing up as â€
on the site. I'm not sure if I just don't understand character encoding, or what.
Any suggestions on the best way to resolve this? Something client side with a meta tag, or some kind of server-side PHP conversion?
Sometimes setting the charset in HTML or the response header isn't enough. If everything isn't setup for UTF-8 on your server, your text may get incorrectly converted somewhere along the way. You may need to enable UTF-8 encoding for both Apache and PHP right in their config files. (If you're not using Apache, try skipping that step.)
Edit either your charset.conf (ideal), or httpd.conf file, by adding this line to the end:
AddDefaultCharset utf-8
(If you don't have access to Apache's config files, you can create a ".htaccess" file in your HTML's root directory with that same code.)
Edit your php.ini file, searching for "default_charset", and change it to:
default_charset = "utf-8"
Depending on your server type, this command may do the trick via command line:
sudo service apache2 restart