Using regular expressions to parse HTML: why not?

tags:

regex

views:

2709

answers:

+13 Q:

Using regular expressions to parse HTML: why not?

It seems like every question on stackoverflow where the asker is using regex to grab some information from HTML will inevitably have an "answer" that says not to use regex to parse HTML.

Why not? I'm aware that there are quote-unquote "real" HTML parsers out there like Beautiful Soup, and I'm sure they're powerful and useful, but if you're just doing something simple, quick, or dirty, then why bother using something so complicated when a few regex statements will work just fine?

Moreover, is there just something fundamental that I don't understand about regex that makes them a bad choice for parsing in general?

+3 A:

Two quick reasons:

writing a regex that can stand up to malicious input is hard; way harder than using a prebuilt tool
writing a regex that can work with the ridiculous markup that you will inevitably be stuck with is hard; way harder than using a prebuilt tool

Regarding the suitability of regexes for parsing in general: they aren't suitable. Have you ever seen the sorts of regexes you would need to parse most languages?

Hank Gay 2009-02-26 14:29:02

+3 A:

Because there are many ways to "screw up" HTML that browsers will treat in a rather liberal way but it would take quite some effort to reproduce the browser's liberal behaviour to cover all cases with regular expressions, so your regex will inevitably fail on some special cases, and that would possibly introduce serious security gaps in your system.

DrJokepu 2009-02-26 14:29:35

Very true, the majority of HTML out there seems to be horrible.I don't understand how a failing regular expression can introduce serious security gaps. Can you give an example?

ntownsend 2009-02-26 14:36:25

ntownsend: For instance, you think you have stripped all the script tags from the HTML but your regex fails cover a special case (that, let's say, only works on IE6): boom, you have an XSS vulerability!

DrJokepu 2009-02-26 14:39:07

This was a strictly hypothetical example since most real world examples are too complicated to fit into these comments but you could find a few by quick googling on the subject.

DrJokepu 2009-02-26 14:40:48

Helpful. Thanks!

ntownsend 2009-02-26 14:51:39

+1 for mentioning the security angle. When you're interfacing with the entire internet you can't afford to write hacky "works most of the time" code.

j_random_hacker 2009-02-26 15:14:27

+2 A:

The problem is that most users who ask a question that has to do with HTML and regex do this because they can't find an own regex that works. Then one has to think whether everything would be easier when using a DOM or SAX parser or something similar. They are optimized and constructed for the purpose of working with XML-like document structures.

Sure, there are problems that can be solved easily with regular expressions. But the emphasis lies on easily.

If you just want to find all URLs that look like http://.../ you're fine with regexps. But if you want to find all URLs that are in a a-Element that has the class 'mylink' you probebly better use a appropriate parser.

okoman 2009-02-26 14:30:34

+16 A:

For quick´n´dirty regexp will do fine. But the fundamental thing to know is that it is impossible to construct a regexp that will correctly parse HTML.

The reason is that regexps can’t handle arbitarly nested expressions. See Can regular expressions be used to match nested patterns?

kmkaplan 2009-02-26 14:32:22

That's it in a nutshell. Wouldn't hurt to mention why though -- namely, because regexes can't support arbitrarily nested patterns.

j_random_hacker 2009-02-26 14:34:26

Yes, can you please give some more details on why it is impossible?

ntownsend 2009-02-26 14:38:58

@j_random_hacker: added referencing another Stackoverflow answer.

kmkaplan 2009-02-26 14:39:01

@kmkaplan: Good stuff!

j_random_hacker 2009-02-27 05:28:30

+35 A:

Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.

Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly but any regular expression.

Johannes Weiß 2009-02-26 14:32:44

Best answer so far. If it can only match regular grammars then we would need an infinitely large regexp to parse a context-free grammar like HTML. I love when these things have clear theoretical answers.

ntownsend 2009-02-26 14:48:00

I assumed we were discussing Perl-type regexes where they aren't actually regular expressions.

Hank Gay 2009-02-26 16:12:27

What is it that makes Perl-type regular expressions not actual regular expressions?

ntownsend 2009-02-26 17:04:16

Excellent response. +1 for both you and the OP.

Alex Barrett 2009-02-26 22:58:13

ntownsend: They can refer to previously-matched parts later in the regexp, among other things. I'm not entirely sure WHERE they end up in the automaton hierarchy, though.

Vatine 2009-02-27 09:24:19

Actually, .Net regular expressions can match opening with closing tags, to some extent, using balancing groups and a carefully crafted expression. Containing _all_ of that in a regexp is still crazy of course, it would look like the great code Chtulhu and would probably summon the real one as well. And in the end it still won't work for all cases. They say that if you write a regular expression that can correctly parse any HTML the universe will collapse onto itself.

Alex Paven 2010-09-16 19:55:07

Regular expressions are not powerful enough for such a language like HTML. Sure, there are some examples where you can use regular expressions. But in general it is not appropriate for parsing.

Gumbo 2009-02-26 14:33:51

+5 A:

As far as parsing goes, regular expressions can be useful in the "lexical analysis" (lexer) stage, where the input is broken down into tokens. It's less useful in the actual "build a parse tree" stage.

For an HTML parser, I'd expect it to only accept well-formed HTML and that requires capabilities outside what a regular expression can do (they cannot "count" and make sure that a given number of opening elements are balanced by the same number of closing elements).

Vatine 2009-02-26 14:34:11

+1 A:

Regular expressions were not designed to handle a nested tag structure, and it is at best complicated (at worst, impossible) to handle all the possible edge cases you get with real HTML.

Peter Boughton 2009-02-26 14:35:50

+4 A:

I believe that the answer lies in computation theory. For a language to be parsed using regex it must be by definition "regular" (link). HTML is not a regular language as it does not meet a number of criteria for a regular language (much to do with the many levels of nesting inherent in html code). If you are interested in the theory of computation I would recommend this book.

tarbot2009 2009-02-26 14:36:31

I've actually read that book. It just didn't occur to me that HTML is a context-free language.

ntownsend 2009-02-26 15:10:33

Thanks for the info though!

ntownsend 2009-02-26 15:11:10

"It depends" though. It's true that regexes don't and can't parse HTML with true accuracy, for all the reasons given here. If, however, the consequences of getting it wrong (such as not handling nested tags) are minor, and if regexes are super-convenient in your environment (such as when you're hacking Perl), go ahead.

Suppose you're, oh, maybe parsing web pages that link to your site--perhaps you found them with a Google link search--and you want a quick way to get a general idea of the context surround your link. You're trying to run a little report that might alert you to link spam, something like that.

In that case, misparsing some of the documents isn't going to be a big deal. Nobody but you will see the mistakes, and if you're very lucky there will be few enough that you can follow up individually.

I guess I'm saying it's a tradeoff. Sometimes implementing or using a correct parser--as easy as that may be--might not be worth the trouble if accuracy isn't critical.

Just be careful with your assumptions. I can think of a few ways the regexp shortcut can backfire if you're trying to parse something that will be shown in public, for example.

catfood 2009-02-26 15:26:20

ansaurus

tags:

views:

answers:

Using regular expressions to parse HTML: why not?

related questions