views:

188

answers:

4

For argument's sake lets assume a HTML parser.

I've read that it tokenizes everything first, and then parses it.

What does tokenize mean?

Does the parser read every character each, building up a multi dimensional array to store the structure?

For example, does it read a < and then begin to capture the element, and then once it meets a closing > (outside of an attribute) it is pushed onto a array stack somewhere?

I'm interested for the sake of knowing (I'm curious).

If I were to read through the source of something like HTML Purifier, would that give me a good idea of how HTML is parsed?

Thanks.

+1  A: 

This exciting Wikipedia page will get you started with tokenisation.

r_
Ha, two other wikipedia links posted in the same time :)
r_
And all three to different articles :-)
JLWarlow
+10  A: 

First of all, you should be aware that parsing HTML is particularly ugly -- HTML was in wide (and divergent) use before being standardized. This leads to all manner of ugliness, such as the standard specifying that some constructs aren't allowed, but then specifying required behavior for those constructs anyway.

Getting to your direct question: tokenization is roughly equivalent to taking English, and breaking it up into words. In English, most words are consecutive streams of letters, possibly including an apostrophe, hyphen, etc. Mostly words are surrounded by spaces, but a period, question mark, exclamation point, etc., can also signal the end of a word. Likewise for HTML (or whatever) you specify some rules about what can make up a token (word) in this language. The piece of code that breaks the input up into tokens is normally known as the lexer.

At least in a normal case, you do not break all the input up into tokens before you start parsing. Rather, the parser calls the lexer to get the next token when it needs one. When it's called, the lexer looks at enough of the input to find one token, delivers that to the parser, and no more of the input is tokenized until the next time the parser needs more input.

In a general way, you're right about how a parser works, but (at least in a typical parser) it uses a stack during the act of parsing a statement, but what it builds to represent a statement is normally a tree (and Abstract Syntax Tree, aka AST), not a multidimensional array.

Based on the complexity of parsing HTML, I'd reserve looking at a parser for it until you've read through a few others first. If you do some looking around, you should be able to find a fair number of parsers/lexers for things like mathematical expressions that are probably more suitable as an introduction (smaller, simpler, easier to understand, etc.)

Jerry Coffin
+1 thanks this is good reading.
alex
+9  A: 

Tokenizing can be composed of a few steps, for example, if you have this html code:

<html>
    <head>
        <title>My HTML Page</title>
    </head>
    <body>
        <p style="special">
            This paragraph has special style
        </p>
        <p>
            This paragraph is not special
        </p>
    </body>
</html>

the tokenizer may convert that string to a flat list of significant tokens, discarding whitespaces (i.e. dicarding unsignificant tokens):

["<", "html", ">", 
     "<", "head", ">", 
         "<", "title", ">", "My HTML Page", "</", "title", ">",
     "</", "head", ">",
     "<", "body", ">",
         "<", "p", "style", "=", "\"", "special", "\"", ">",
            "This paragraph has special style",
        "</", "p", ">",
        "<", "p", ">",
            "This paragraph is not special",
        "</", "p", ">",
    "</", "body", ">",
"</", "html", ">"
]

that raw tokenizing may have a second pass to convert it to a list of higher-level tokens like (which is still a flat list):

[("<html>", {}), 
     ("<head>", {}), 
         ("<title>", {}), "My HTML Page", "</title>",
     "</head>",
     ("<body>", {}),
        ("<p>", {"style": "special"}),
            "This paragraph has special style",
        "</p>",
        ("<p>", {}),
            "This paragraph is not special",
        "</p>",
    "</body>",
"</html>"
]

then the parser converts that list of high level tokens to form a tree:

("<html>", {}, [
    ("<head>", {}, [
        ("<title>", {}, ["My HTML Page"]),
    ]), 
    ("<body>", {}, [
        ("<p>", {"style": "special"}, ["This paragraph has special style"]),
        ("<p>", {}, ["This paragraph is not special"]),
    ]),
])

at this point, the parsing is complete; and it is then up to the user to interpret the tree, modify it, etc.

Lie Ryan
+1 like the example
Yuval A
+1 for actually showing what tokenizing does
alex
+3  A: 

Don't miss the W3C's notes on parsing HTML5.

For an interesting introduction to scanning/lexing, take a look at Efficient Generation of Table-Driven Scanners. It shows how scanning is ultimately driven by automata theory. A collection of regular expressions is transformed into a single NFA . The NFA is then transformed to a DFA to make state transitions deterministic. The paper then describes a method to transform the DFA into a transition table.

A key point: scanners use regular expression theory but likely don't use existing regular expression libraries. For better performance, state transitions are coded as giant case statements or in transition tables.

Scanners guarantee that correct words(tokens) are used. Parsers guarantee the words are used in the correct combination and order. Scanners use regular expression and automata theory. Parsers use grammar theory, especially context-free grammars.

A couple parsing resources:

Corbin March
+1 thanks for the W3C link. It looks like an informative (and long) read!
alex