views:

832

answers:

5

I see questions every day asking how to parse or extract something from some HTML string and the first answer/comment is always "Don't use RegEx to parse HTML, lest you feel the wrath!" (that last part is sometimes omitted).

This is rather confusing for me, I always thought that in general, the best way to parse any complicated string is to use a regular expression. So how does a HTML parser work? Doesn't it use regular expressions to parse.

One particular argument for using a regular expression is that there's not always a parsing alternative (such as JavaScript, where DOMDocument isn't a universally available option). jQuery, for instance, seems to manage just fine using a regex to convert a HTML string to DOM nodes.

Not sure whether or not to CW this, it's a genuine question that I want to be answered and not really intended to be a discussion thread.

+2  A: 

If you want to have a 100% solution: You need to write your own custom code that iterates through the HTML character-by-character and you need to have a tremendous amount of logic to determine if you should stop the current node and start the next.

The reason is that this is valid HTML:

<ul>
<li>One
<li>Two
<li>Three
</ul>

But so is this:

<ul>
<li>One</li>
<li>Two</li>
<li>Three</li>
</ul>

If you are ok with "90% solution": Then using an XML parser to load a document is fine. Or using Regex (though the xml is easier if you are then master of the content).

Timothy Khouri
Sure, but wouldn't iterating through the document character by character testing tremendous amounts of logic be rather slow? Naturally, a browser is a 100% solution (although I guess not because they all parse differently, but close enough), do they iterate through the document like that?.
Andy E
An XML parser is more like a 1% solution. The number of HTML documents that are well formed XML is tiny.
David Dorward
Yes, they do... don't take "character by character" literally, as you can try to stream things. But my point is that you have to write your own parser. New-aged programmers are not used to writing that kind of code... we're used to "HtmlDocumentUtility.Load" and stuff like that :)
Timothy Khouri
David, that's not true... like I said, if you are the master of the content then it can be a 100% solution.
Timothy Khouri
I agree and would add that the second solution using XML parser is not safe when dealing with public HTML since a large amount of HTML pages are not XML valid. It would not fit properly in the parser. Usually HTML parsers are using both XML parsers and regexp, according to the parse requirements and according to the HTML source provided.
snowflake
Also, Andy... no, it's not slow. If you've ever written your own video game - you'd probably have written your own "map file" which has terrain height, custom cell logic, creature spawn spots, etc. Those files can be several megs - and loaded in 2 milliseconds.
Timothy Khouri
@Andy E: Regexes are not magic, they also work character by character, like any other kind of parsing, or heck, any other string function.
Bart van Heukelom
BTW: Your first example is not just "semi-valid HTML". It's actually valid HTML 4.01 Strict. You can use e.g. the W3C validator to verify this. The closing tag is officially optional for <li> (see the HTML 4 spec).
sleske
@Bart: good point, sometimes my brain forgets all logic and thinks things work by magic.
Andy E
The need to enable parsing of malformed input has only peripheral significance for the inappropriateness of regular expressions.
Svante
+24  A: 

Usually by using a tokeniser. The draft HTML5 specification has an extensive algorithm for handling "real world HTML".

David Dorward
Good find... to quote "To handle these cases, parsers have a script nesting level, which must be initially set to zero, and a parser pause flag, which must be initially set to false." - In other words, you must iterate it yourself and have lots of custom logic :P
Timothy Khouri
Upvote. It's better to emphasize algorithmic complexity instead of some technology.
Arnis L.
Iterating it yourself with lots of custom logic isn't such a great idea. Use a library that supports the standard algorithm if you can. e.g. http://search.cpan.org/~tobyink/HTML-HTML5-Parser-0.03/lib/HTML/HTML5/Parser.pm / http://code.google.com/p/html5lib/
David Dorward
@David - That's true, but his question was "how are HTML parsers written"... so I answered the question (which you more clearly by saying 'usually by using a tokeniser').
Timothy Khouri
@David: Thanks for the link, I'm not sure why it didn't occur to me to actually read the specification. Would an XML parser work in much the same way, and would it be safe to use RegEx to parse XML?
Andy E
The primary problem with HTML parsers is that upon encountering an error, you're not okay to spit out "Parse error" and leave it at that. You enter quirks mode and try to make out the best you can from the mess you encountered, including mismatched tags, [{]} style interlace, and all kinds of weirdness, trying to make the result look as good as you can and the inevitable failure the least painful... this is not something you can do with regexes.
SF.
@Timothy K: 'Note: Because of the way this algorithm causes elements to change parents, it has been dubbed the "adoption agency algorithm" (in contrast with other possible algorithms for dealing with misnested content, which included the "incest algorithm", the "secret affair algorithm", and the "Heisenberg algorithm").'
JXG
@Andy E: you can't use regexes to parse XML, although (I haven't read the spec) in certain cases you may be able to use a simpler computational model than a whole lexer + parser with the bells and whistles. I explain why in my answer (http://stackoverflow.com/questions/2400623/if-youre-not-supposed-to-use-regular-expressions-to-parse-html-then-how-are-htm/2400852#2400852).
JXG
+11  A: 

Regular expressions are just one form of parser. An honest-to-goodness HTML parser will be significantly more complicated than can be expressed in regexes, using recursive descent, prediction, and several other techniques to properly interpret the text. If you really want to get into it, you might check out lex & yacc and similar tools.

The prohibition against using regexes for HTML parsing should probably be written more correctly as: "Don't use naive regular expressions to parse HTML..." (lest ye feel the wrath) "...and treat the results with caution." For certain specific goals, a regex may well be perfectly adequate, but you need to be very careful to be aware of the limitations of your regex and as cautious as is appropriate to the source of the text you're parsing (e.g., if it's user input, be very careful indeed).

T.J. Crowder
+1, a good answer. I must admit, I've used regexes before even when I wasn't in control of the HTML, but not in any sort of publicly released application. I did "feel the wrath", too, because it was naive. But that was a long time ago :-)
Andy E
+38  A: 

So how does a HTML parser work? Doesn't it use regular expressions to parse?

Well, no.

If you reach back in your brain to a theory of computation course, if you took one, or a compilers course, or something similar, you may recall that there are different kinds of languages and computational models. I'm not qualified to go into all the details, but I can review a few of the major points with you.

The simplest type of language & computation (for these purposes) is a regular language. These can be generated with regular expressions, and recognized with finite automata. Basically, that means that "parsing" strings in these languages use state, but not auxiliary memory. HTML is certainly not a regular language. If you think about it, the list of tags can be nested arbitrarily deeply. For example, tables can contain tables, and each table can contain lots of nested tags. With regular expressions, you may be able to pick out a pair of tags, but certainly not anything arbitrarily nested.

A classic simple language that is not regular is correctly matched parentheses. Try as you might, you will never be able to build a regular expression (or finite automaton) that will always work. You need memory to keep track of the nesting depth.

A state machine with a stack for memory is the next strength of computational model. This is called a push-down automaton, and it recognizes languages generated by context-free grammars. Here, we can recognize correctly matched parentheses--indeed, a stack is the perfect memory model for it.

Well, is this good enough for HTML? Sadly, no. Maybe for super-duper carefully validated XML, actually, in which all the tags always line up perfectly. In real-world HTML, you can easily find snippets like <b><i>wow!</b></i>. This obviously doesn't nest, so in order to parse it correctly, a stack is just not powerful enough.

The next level of computation is languages generated by general grammars, and recognized by Turing machines. This is generally accepted to be effectively the strongest computational model there is--a state machine, with auxiliary memory, whose memory can be modified anywhere. This is what programming languages can do. This is the level of complexity where HTML lives.

To summarize everything here in one sentence: to parse general HTML, you need a real programming language, not a regular expression.

HTML is parsed the same way other languages are parsed: lexing and parsing. The lexing step breaks down the stream of individual characters into meaningful tokens. The parsing step assembles the tokens, using states and memory, into a logically coherent document that can be acted on.

JXG
+1 excellent explanation
rogeriopvl
+1 great explanation, you definitely explained it well. :)
Daniel15
A: 

Parsing HTML is the transformation of a linear text into a tree structure. Regular expressions cannot generally handle tree structures. The regular expression you need at each point to get the next token changes all the time. You can use regular expressions in a parser, but you will need a whole array of regular expressions for each possible state of parsing.

Svante