views:

415

answers:

10

While it's absolutely true that regexp are not the right tool to fully parse HTML documents, I am seeing a lot of people blindly disregarding any question about regexp if they as much as see a single HTML tag in the proposed text.

Since we see a lot of examples of regexp not being the right tool, I ask your opinion on this: what are the cases where a simple pattern match is a better solution than using a full parsing engine?

+1  A: 

If you can guarantee that the pattern you need to match is within a single HTML tag, then maybe you could create a regular expression to match it.

In other words, not when you need an expression to find matching tag/endtags and not when the content you need to match might contain nested tags, comments, CDATA sections, etc.

Bill Karwin
+10  A: 

If the set of HTML you're looking to parse with a regexp is known to conform to some sort of pattern. e.g. if you know there's no commented-out HTML, or complex scenarios etc.

e.g. I often preach that you shouldn't use regexps for HTML, but if I have a set of HTML that I'm familiar with, is straightforward and that I can check easily post-manipulation, then I have no qualms about using a regexp for that.

Brian Agnew
+4  A: 

I think the best answer here is: regular expressions are the right tool except for when they aren't.

I think if you can cleanly and effectively solve your problem using regex, then go for it. But i've seen far too many regex hacks because the programmer / web designer is just plain lazy.

Regex is powerful and one of the best tools a programmer can learn, but you also need to learn when to use it and when to use something different.

Robert Greiner
+1  A: 

If the information that you are using has a regular grammar, then regexs are great. HTML doesn't have a regular grammar, so things are more complex.

Regexs are suitable if you absolutely 100% know what sort of thing you are looking for - replacing:

<tag>Info</tag>

with

<tag>Dave</tag>

In a document that you have complete control of would make sense, but real life HTML isn't like this.

Rich Bradshaw
there is "real life" HTML which is known and predictable too, you know. if it's something which someone else is arbitrarily editing, then it won't be known. However, if there is a program which always outputs in a particular format, that's still "real life" and regexes would work (...until the program changes).
nickf
+2  A: 

Obviously, in the most simple cases like

<a>Test</a>

you might get along with a regex. But even then, a perfectly valid HTML tag could come in so many different varieties:

< A > Test</a>                // match
< a href="test">   Test</a>   // match
< A TEST="test"/>             // no match
< a href="test<">Test</A>     // invalid input - catch that with a regex!

that the regex to catch them reliably gets HUGE. A DOM based parser will parse it, give you a proper error message if it fails, and provide stable results.

Pekka
+1  A: 

When you know what you're doing!

; )

Bart Kiers
Isn't the general tautology that when you know what you're doing, you'll know it's unwise to do?
eyelidlessness
+3  A: 

Jeff Atwood discusses it extensively in his blog posts entitled Programming Is Hard Let's Go Shopping and Parsing HTML The Cthulhu Way.

"So, yes, generally speaking, it is a bad idea to use regular expressions when parsing HTML. We should be teaching neophyte developers that, absolutely. Even though it's an apparently neverending job. But we should also be teaching them the very real difference between parsing HTML and the simple expedience of processing a few strings. And how to tell which is the right approach for the task at hand."

Find more details in the posts mentioned above.

Gregory Pakosz
+1  A: 

One thing worth keeping in mind is that there are two main sources of objection to processing HTML with regular expressions. One source has to do with the probability of junk HTML that is unpredictably malformed. This is itself a legitimate reason to be skeptical when approaching HTML processing with regex, and tosses out a lot of use cases from the start. The problem is that this source is often used to "throw out the baby with the bathwater", and is also often conflated with the second main source of objection (and usually both left unsaid) even though they're completely unrelated.

The other main source of objection has to do with HTML language complexity exceeding some idealized, theoretical conception of "regular expression" that is too general to apply to many use cases—but is usually applied across the board. The objection goes something like this:

  1. Truism: Regular expressions process regular grammars.
  2. Truism: HTML is not a regular grammar.
  3. HTML cannot be processed with regular expressions.

I think a lot of people really just take these truisms at face value without considering what's meant by them. Bill Karwin, in another answer here, mentioned some cases where HTML is not a regular grammar, but this argument falls apart when the context is a "regex" engine that has non-regular features (like back references, or even recursion). These features solve many of the "not a regular grammar" objections, but may still fail on malformed documents.

This distinction is rarely drawn and it's rarely pointed out that most modern "regular" expression libraries have capabilities far beyond regular language processing. I think these are important things to consider whenever evaluating "regular" expressions for the appropriate tool to process some HTML.

eyelidlessness
A: 

You can use regexp when either you parse HTML you have control over or you are writing a parser for one specific HTML page. You should not use regexp when trying to build universal parser.

serg
A: 

I just found out an example of regexp beating html parser. I needed to extract some information from a long page (8231 lines, 400kb) and I first tried using simple_html_dom. Since I got stuck due to the problem reported in this question, I went for the alternative approach and I realized that I actually only needed informations contained in the first 416 lines of that file (~4% of the total) and loading the whole DOM into memory looked like a huge waste of resources.

Now I still don't know why simplehtmldom is failing on that, so I can't really compare the performance of the two solutions, but the regexp version only loads as many lines as needed (up to the end of the <ul> I'm interested in and no more) and is very quick.

kemp