Where can I find a good HTMLEditorKit tutorial/reference, which actually explains how to edit HTML documents?

Oren Shalev picture Oren Shalev · Sep 21, 2009 · Viewed 9.5k times · Source

My intention is to edit HTML documents, including modifying existing elements, deleting elements and inserting new ones.

I've read HTMLEditorKit's and related classes' documentation, as well as the relevant topic in Sun's Java Trail, yet there is very little information about actual HTML document manipulation. Most of the discussion and examples deal with reading and parsing HTML, not really editing it. Some Googling still did not yield an adequate solution, and trying to tackle the task with some coding trial and error mostly resulted in exceptions.

I've gone over related questions and answers here in SO, but most answers suggested some alternative, while I'm looking for a solution in the JDK. Perhaps HTMLEditorKit is of little use to non-swing applications, and there is an alternative outside javax.swing?

Here are a few tasks I'd like to learn how to perform:

  • Replace text in certain text fields.
  • Basic editing (find/replace or regexes) of <script> elements.
  • Color the border of certain elements.
  • Remove certain tags entirely (for example flash elements).

Assuming that HTMLEditorKit is the best HTML editing component in the JDK, what tutorial or reference do you recommend?

Answer

Aaron Digulla picture Aaron Digulla · Sep 21, 2009

The HTMLEditorKit is not an HTML editor but an editor for document models which allows to convert these document models from and to HTML. The internal model of the editor kit is not "HTML" but is based on DefaultStyledDocument. What confuses you is that there is a HTMLDocument class. But that is just a thin wrapper for the DefaultStyledDocument so it can be created from HTML and saved as HTML.

What you need is an HTML parser. Try jTidy. It will read the HTML, build an internal model (keeping things like <script> which HTMLEditorKit will ignore). You can then use a DOM API to modify the model.

That said, for many use cases, it's enough to filter the HTML with regular expressions or simple string search&replace.