How to change HTML tag content in Java?

bugisoft picture bugisoft · Dec 19, 2009 · Viewed 20k times · Source

How can I change HTML content of tag in Java? For example:

before:

<html>
    <head>
    </head>
    <body>
        <div>text<div>**text**</div>text</div>
    </body>
</html>

after:

<html>
    <head>
    </head>
    <body>
        <div>text<div>**new text**</div>text</div>
    </body>
</html>

I tried JTidy, but it doesn't support getTextContent. Is there any other solution?


Thanks, I want parse no well-formed HTML. I tried TagSoup, but when I have this code:

<body>
sometext <div>text</div>
</body>

and I want change "sometext" to "someAnotherText," and when I use {bodyNode}.getTextContent() it gives me: "sometext text"; when I use setTextContet("someAnotherText"+{bodyNode}.getTextContent()), and serialize these structure, the result is <body>someAnotherText sometext text</body>, without <div> tags. This is a problem for me.

Answer

Pascal Thivent picture Pascal Thivent · Dec 19, 2009

Unless you are absolutely sure that the HTML will be valid and well formed, I'd strongly recommend to use an HTML parser, something like TagSoup, Jericho, NekoHTML, HTML Parser, etc, the two first being especially powerful to parse any kind of crap :)

For example, with HTML Parser (because the implementation is very easy), using a visitor, provide your own NodeVisitor:

public class MyNodeVisitor extends NodeVisitor {
    public MyNodeVisitor() {
    }

    public void visitStringNode (Text string)
    {
        if (string.getText().equals("**text**")) {
            string.setText("**new text**");
        }
    }
}

Then, create a Parser, parse the HTML string and visit the returned node list:

Parser parser = new Parser(htmlString);
NodeList nl = parser.parse(null);
nl.visitAllNodesWith(new MyNodeVisitor());
System.out.println(nl.toHtml());

This is just one way to implement this, pretty straight forward.