Replace HTML codes with equivalent characters in Java

Question 1

Replace HTML codes with equivalent characters in Java

java pattern-matching matcher

Raja Asthana · Feb 21, 2013 · Viewed 37.8k times · Source

Answer

Answer

Also, is there any way to optimize this regex?

Yes, don't use regex for this task, use Apache StringEscapeUtils from Apache commons lang:

import org.apache.commons.lang.StringEscapeUtils;
...
String withCharacters = StringEscapeUtils.unescapeHtml(yourString);

JavaDoc says:

Unescapes a string containing entity escapes to a string containing the actual Unicode characters corresponding to the escapes. Supports HTML 4.0 entities.

For example, the string "<Français>" will become "<Français>"

If an entity is unrecognized, it is left alone, and inserted verbatim into the result string. e.g. ">&zzzz;x" will become ">&zzzz;x".

Question 2

Currently I'm working on converting HTML codes with equivalent characters in java. I need to convert the below code to characters.

&#x00E8; - è
&#xAE;   - ®
&#x0026; - &
&#x00F1; - ñ
&#x26;   - &

I tried using the regex pattern

(&#x)([\\d|\\w]*)([\\d|\\w]*)([\\d|\\w]*)([\\d|\\w]*)(;)

When I debug, matcher.find() gives me true but the control skips the loop where I have written the code for conversion. Don't know what is happening there.

Also, is there any way to optimize this regex?

Any help is appreciated.

Exception

java.lang.NumberFormatException: For input string: "x26"
      at java.lang.NumberFormatException.forInputString(Unknown Source)
      at java.lang.Integer.parseInt(Unknown Source)
      at java.lang.Integer.parseInt(Unknown Source)
      at org.apache.commons.lang.Entities.unescape(Entities.java:683)
      at org.apache.commons.lang.StringEscapeUtils.unescapeHtml(StringEscapeUtils.java:483)

Replace HTML codes with equivalent characters in Java

Answer

Related questions