Jsoup - extracting text

Eugene Retunsky picture Eugene Retunsky · Apr 16, 2012 · Viewed 12.2k times · Source

I need to extract text from a node like this:

<div>
    Some text <b>with tags</b> might go here.
    <p>Also there are paragraphs</p>
    More text can go without paragraphs<br/>
</div>

And I need to build:

Some text <b>with tags</b> might go here.
Also there are paragraphs
More text can go without paragraphs

Element.text returns just all content of the div. Element.ownText - everything that is not inside children elements. Both are wrong. Iterating through children ignores text nodes.

Is there are way to iterate contents of an element to receive text nodes as well. E.g.

  • Text node - Some text
  • Node <b> - with tags
  • Text node - might go here.
  • Node <p> - Also there are paragraphs
  • Text node - More text can go without paragraphs
  • Node <br> - <empty>

Answer

Vadim Ponomarev picture Vadim Ponomarev · Apr 16, 2012

Element.children() returns an Elements object - a list of Element objects. Looking at the parent class, Node, you'll see methods to give you access to arbitrary nodes, not just Elements, such as Node.childNodes().

public static void main(String[] args) throws IOException {
    String str = "<div>" +
            "    Some text <b>with tags</b> might go here." +
            "    <p>Also there are paragraphs</p>" +
            "    More text can go without paragraphs<br/>" +
            "</div>";

    Document doc = Jsoup.parse(str);
    Element div = doc.select("div").first();
    int i = 0;

    for (Node node : div.childNodes()) {
        i++;
        System.out.println(String.format("%d %s %s",
                i,
                node.getClass().getSimpleName(),
                node.toString()));
    }
}

Result:

1 TextNode 
 Some text 
2 Element <b>with tags</b>
3 TextNode  might go here. 
4 Element <p>Also there are paragraphs</p>
5 TextNode  More text can go without paragraphs
6 Element <br/>