Parse HTML data in Java including &lt and &gt tags?

java html-parsing htmleditorkit

Deepu · Dec 17, 2012 · Viewed 12.5k times · Source

I want to parse HTML text in Java.

I have tried to parse HTML data using javax.swing.text.html.HTMLEditorKit. It helped me to get data from HTML. But I have a HTML data like -

&lt;span class="TitleServiceChange" &gt;Service Change&lt;/span&gt;
                    &lt;span class="DateStyle"&gt;
                    &amp;nbsp;Posted:&amp;nbsp;12/16/2012&amp;nbsp; 8:00PM
                    &lt;/span&gt;&lt;br/&gt;&lt;br/&gt;
                  &lt;P&gt;

with surrounding '&lt' and '&gt' instead of '<' and '>'

While parsing the above text I am getting the error -

Parsing error: start.missing body ? ? at

Please suggest me to resolve my problem. Thanks in advance.

Answer

For unescaping the full set of escaped characters included at a string, you could make use of the Apache Commons Lang utility library.

Specifically, using the StringEscapeUtils class, where you can find the unescapeHtml4 method, among others.

Parse HTML data in Java including &lt and &gt tags?

Answer

Related questions