I've a program which generates XML sitemaps for Google Webmaster Tools (among other things).
GWTs is giving me errors for some sitemaps because the URLs contain character sequences like ã¾, ã‹, ã€, etc. **
GWTs says:
We require your Sitemap file to be UTF-8 encoded (you can generally do this when you save the file). As with all XML files, any data values (including URLs) must use entity escape codes for the characters: &, ', ", <, >.
The special characters are excaped in the XML files (with HTML entities).
XML file snippet:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://domain/folder/listing-ã.shtml</loc>
...
Are my URLs UTF-8 encoded?
If not, How do I do this in Java?
The following is the line in my program where I add the URL to the sitemap:
siteMap.addUrl(StringEscapeUtils.escapeXml(countryName+"/"+twoCharFile.getRelativeFileName().toLowerCase()));
** = I'm not sure which ones are causing the error, probably the first two examples.
I apologize for all the editing.
Try using URLEncoder.encode(stringToBeEncoded, "UTF-8")
to encode the url.