Parse HTML data in Java including &lt and &gt tags?

Deepu picture Deepu · Dec 17, 2012 · Viewed 12.5k times · Source

I want to parse HTML text in Java.

I have tried to parse HTML data using javax.swing.text.html.HTMLEditorKit. It helped me to get data from HTML. But I have a HTML data like -

<span class="TitleServiceChange" >Service Change</span>
                    <span class="DateStyle">
                     Posted: 12/16/2012  8:00PM
                    </span><br/><br/>
                  <P>

with surrounding '&lt' and '&gt' instead of '<' and '>'

While parsing the above text I am getting the error -

Parsing error: start.missing body ? ? at

Please suggest me to resolve my problem. Thanks in advance.

Answer

Tomas Narros picture Tomas Narros · Dec 17, 2012

For unescaping the full set of escaped characters included at a string, you could make use of the Apache Commons Lang utility library.

Specifically, using the StringEscapeUtils class, where you can find the unescapeHtml4 method, among others.