UTF-8 encode URLs

Adam Lynch picture Adam Lynch · May 23, 2011 · Viewed 44.8k times · Source

Info:

I've a program which generates XML sitemaps for Google Webmaster Tools (among other things).
GWTs is giving me errors for some sitemaps because the URLs contain character sequences like ã¾, ã‹, ã€, etc. **

GWTs says:

We require your Sitemap file to be UTF-8 encoded (you can generally do this when you save the file). As with all XML files, any data values (including URLs) must use entity escape codes for the characters: &, ', ", <, >.

The special characters are excaped in the XML files (with HTML entities).
XML file snippet:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>http://domain/folder/listing-&#227;&#129;.shtml</loc>
        ...

Are my URLs UTF-8 encoded?

If not, How do I do this in Java?
The following is the line in my program where I add the URL to the sitemap:

    siteMap.addUrl(StringEscapeUtils.escapeXml(countryName+"/"+twoCharFile.getRelativeFileName().toLowerCase()));

** = I'm not sure which ones are causing the error, probably the first two examples.

I apologize for all the editing.

Answer

Jai picture Jai · May 23, 2011

Try using URLEncoder.encode(stringToBeEncoded, "UTF-8") to encode the url.