For argument's sake lets assume a HTML parser.
I've read that it tokenizes everything first, and then parses it.
What does tokenize mean?
Does the parser read every character each, building up a multi dimensional array to store the structure?
For example, does it read a <
and then begin to capture the element, and then once it meets a closing >
(outside of an attribute) it is pushed onto a array stack somewhere?
I'm interested for the sake of knowing (I'm curious).
If I were to read through the source of something like HTML Purifier, would that give me a good idea of how HTML is parsed?
Tokenizing can be composed of a few steps, for example, if you have this html code:
<html>
<head>
<title>My HTML Page</title>
</head>
<body>
<p style="special">
This paragraph has special style
</p>
<p>
This paragraph is not special
</p>
</body>
</html>
the tokenizer may convert that string to a flat list of significant tokens, discarding whitespaces (thanks, SasQ for the correction):
["<", "html", ">",
"<", "head", ">",
"<", "title", ">", "My HTML Page", "</", "title", ">",
"</", "head", ">",
"<", "body", ">",
"<", "p", "style", "=", "\"", "special", "\"", ">",
"This paragraph has special style",
"</", "p", ">",
"<", "p", ">",
"This paragraph is not special",
"</", "p", ">",
"</", "body", ">",
"</", "html", ">"
]
there may be multiple tokenizing passes to convert a list of tokens to a list of even higher-level tokens like the following hypothetical HTML parser might do (which is still a flat list):
[("<html>", {}),
("<head>", {}),
("<title>", {}), "My HTML Page", "</title>",
"</head>",
("<body>", {}),
("<p>", {"style": "special"}),
"This paragraph has special style",
"</p>",
("<p>", {}),
"This paragraph is not special",
"</p>",
"</body>",
"</html>"
]
then the parser converts that list of tokens to form a tree or graph that represents the source text in a manner that is more convenient to access/manipulate by the program:
("<html>", {}, [
("<head>", {}, [
("<title>", {}, ["My HTML Page"]),
]),
("<body>", {}, [
("<p>", {"style": "special"}, ["This paragraph has special style"]),
("<p>", {}, ["This paragraph is not special"]),
]),
])
at this point, the parsing is complete; and it is then up to the user to interpret the tree, modify it, etc.