How do I convert special characters using java?

Vladimir picture Vladimir · Feb 18, 2010 · Viewed 35.5k times · Source

I have strings like:

Avery® Laser & Inkjet Self-Adhesive

I need to convert them to

Avery Laser & Inkjet Self-Adhesive.

I.e. remove special characters and convert html special chars to regular ones.

Answer

BalusC picture BalusC · Feb 18, 2010
Avery® Laser & Inkjet Self-Adhesive

First use StringEscapeUtils#unescapeHtml4() (or #unescapeXml(), depending on the original format) to unescape the & into a &. Then use String#replaceAll() with [^\x20-\x7e] to get rid of characters which aren't inside the printable ASCII range.

Summarized:

String clean = StringEscapeUtils.unescapeHtml4(dirty).replaceAll("[^\\x20-\\x7e]", "");

..which produces

Avery Laser & Inkjet Self-Adhesive

(without the trailing dot as in your example, but that wasn't present in the original ;) )

That said, this however look like more a request to workaround than a request to solution. If you elaborate more about the functional requirement and/or where this string did originate, we may be able to provide the right solution. The ® namely look like to be caused by using the wrong encoding to read the string in and the & look like to be caused by using a textbased parser to read the string in instead of a fullfledged HTML parser.