I am currently attempting (or planning to attempt) to write a simple (as possible) program to parse an html document into a tree.
After googling I have found many answers saying "don't do it it's been done" (or words to that effect); and references to examples of HTML parsers; and also a rather emphatic article on why one shouldn't use Regular expresions. However I haven't found any guides on the "right" way to write a parser. (This, by the way, is something I'm attempting more as a learning exersise than anything so I'd quite like to do it rather than use a premade one)
I believe I could make a working XML parser just by reading the document and adding the tags/text etc. to the tree, stepping up a level whenever I hit a close tag (again, simple, no fancy threading or efficiency required at this stage.). However, for HTML not all tags are closed.
So my question is this: what would you recommend as a way of dealing with this? The only idea I've had is to treat it in a similar way as the XML but have a list of tags that aren't necessarily closed each with conditions for closure (e.g. <p> ends on </p> or next <p> tag).
Has anyone any other (hopefully better) suggestions? Is there a better way of doing this altogether?
The looseness of HTML can be accommodated by figuring out the missing open and close tags as needed. This is essentially what a validator like tidy does.
You'll keep a stack (perhaps implicitly with a tree) of the current context. For example, {<html>
, <body>
} means you're currently in the body of the html document. When you encounter a new node, you compare the requirements for that node to what's currently on the stack.
Suppose your stack is currently just {html
}. You encounter a <p>
tag. You look up <p>
in a table that tells you a paragraph must be inside the <body>
. Since you're not in the body, you implicitly push <body>
onto your stack (or add a body node to your tree). Then you can put the <p>
into the tree.
Now supposed you see another <p>
. Your rules tell you that you cannot nest a paragraph within a paragraph, so you know you have to pop the current <p>
off the stack (as though you had seen a close tag) before pushing the new paragraph onto the stack.
At the end of your document, you pop each remaining element off your stack, as though you had seen a close tag for each one.
The trick is to find a good way to represent the context requirements for each element.