tags:

views:

2709

answers:

10

It seems like every question on stackoverflow where the asker is using regex to grab some information from HTML will inevitably have an "answer" that says not to use regex to parse HTML.

Why not? I'm aware that there are quote-unquote "real" HTML parsers out there like Beautiful Soup, and I'm sure they're powerful and useful, but if you're just doing something simple, quick, or dirty, then why bother using something so complicated when a few regex statements will work just fine?

Moreover, is there just something fundamental that I don't understand about regex that makes them a bad choice for parsing in general?

+3  A: 

Two quick reasons:

  • writing a regex that can stand up to malicious input is hard; way harder than using a prebuilt tool
  • writing a regex that can work with the ridiculous markup that you will inevitably be stuck with is hard; way harder than using a prebuilt tool

Regarding the suitability of regexes for parsing in general: they aren't suitable. Have you ever seen the sorts of regexes you would need to parse most languages?

Hank Gay
+3  A: 

Because there are many ways to "screw up" HTML that browsers will treat in a rather liberal way but it would take quite some effort to reproduce the browser's liberal behaviour to cover all cases with regular expressions, so your regex will inevitably fail on some special cases, and that would possibly introduce serious security gaps in your system.

DrJokepu
Very true, the majority of HTML out there seems to be horrible.I don't understand how a failing regular expression can introduce serious security gaps. Can you give an example?
ntownsend
ntownsend: For instance, you think you have stripped all the script tags from the HTML but your regex fails cover a special case (that, let's say, only works on IE6): boom, you have an XSS vulerability!
DrJokepu
This was a strictly hypothetical example since most real world examples are too complicated to fit into these comments but you could find a few by quick googling on the subject.
DrJokepu
Helpful. Thanks!
ntownsend
+1 for mentioning the security angle. When you're interfacing with the entire internet you can't afford to write hacky "works most of the time" code.
j_random_hacker
+2  A: 

The problem is that most users who ask a question that has to do with HTML and regex do this because they can't find an own regex that works. Then one has to think whether everything would be easier when using a DOM or SAX parser or something similar. They are optimized and constructed for the purpose of working with XML-like document structures.

Sure, there are problems that can be solved easily with regular expressions. But the emphasis lies on easily.

If you just want to find all URLs that look like http://.../ you're fine with regexps. But if you want to find all URLs that are in a a-Element that has the class 'mylink' you probebly better use a appropriate parser.

okoman
+16  A: 

For quick´n´dirty regexp will do fine. But the fundamental thing to know is that it is impossible to construct a regexp that will correctly parse HTML.

The reason is that regexps can’t handle arbitarly nested expressions. See Can regular expressions be used to match nested patterns?

kmkaplan
That's it in a nutshell. Wouldn't hurt to mention why though -- namely, because regexes can't support arbitrarily nested patterns.
j_random_hacker
Yes, can you please give some more details on why it is impossible?
ntownsend
@j_random_hacker: added referencing another Stackoverflow answer.
kmkaplan
@kmkaplan: Good stuff!
j_random_hacker
+35  A: 

Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.

Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly but any regular expression.

Johannes Weiß
Best answer so far. If it can only match regular grammars then we would need an infinitely large regexp to parse a context-free grammar like HTML. I love when these things have clear theoretical answers.
ntownsend
I assumed we were discussing Perl-type regexes where they aren't actually regular expressions.
Hank Gay
What is it that makes Perl-type regular expressions not actual regular expressions?
ntownsend
Excellent response. +1 for both you and the OP.
Alex Barrett
ntownsend: They can refer to previously-matched parts later in the regexp, among other things. I'm not entirely sure WHERE they end up in the automaton hierarchy, though.
Vatine
Actually, .Net regular expressions can match opening with closing tags, to some extent, using balancing groups and a carefully crafted expression. Containing _all_ of that in a regexp is still crazy of course, it would look like the great code Chtulhu and would probably summon the real one as well. And in the end it still won't work for all cases. They say that if you write a regular expression that can correctly parse any HTML the universe will collapse onto itself.
Alex Paven
A: 

Regular expressions are not powerful enough for such a language like HTML. Sure, there are some examples where you can use regular expressions. But in general it is not appropriate for parsing.

Gumbo
+5  A: 

As far as parsing goes, regular expressions can be useful in the "lexical analysis" (lexer) stage, where the input is broken down into tokens. It's less useful in the actual "build a parse tree" stage.

For an HTML parser, I'd expect it to only accept well-formed HTML and that requires capabilities outside what a regular expression can do (they cannot "count" and make sure that a given number of opening elements are balanced by the same number of closing elements).

Vatine
+1  A: 

Regular expressions were not designed to handle a nested tag structure, and it is at best complicated (at worst, impossible) to handle all the possible edge cases you get with real HTML.

Peter Boughton
+4  A: 

I believe that the answer lies in computation theory. For a language to be parsed using regex it must be by definition "regular" (link). HTML is not a regular language as it does not meet a number of criteria for a regular language (much to do with the many levels of nesting inherent in html code). If you are interested in the theory of computation I would recommend this book.

tarbot2009
I've actually read that book. It just didn't occur to me that HTML is a context-free language.
ntownsend
Thanks for the info though!
ntownsend
A: 

"It depends" though. It's true that regexes don't and can't parse HTML with true accuracy, for all the reasons given here. If, however, the consequences of getting it wrong (such as not handling nested tags) are minor, and if regexes are super-convenient in your environment (such as when you're hacking Perl), go ahead.

Suppose you're, oh, maybe parsing web pages that link to your site--perhaps you found them with a Google link search--and you want a quick way to get a general idea of the context surround your link. You're trying to run a little report that might alert you to link spam, something like that.

In that case, misparsing some of the documents isn't going to be a big deal. Nobody but you will see the mistakes, and if you're very lucky there will be few enough that you can follow up individually.

I guess I'm saying it's a tradeoff. Sometimes implementing or using a correct parser--as easy as that may be--might not be worth the trouble if accuracy isn't critical.

Just be careful with your assumptions. I can think of a few ways the regexp shortcut can backfire if you're trying to parse something that will be shown in public, for example.

catfood