views:

168

answers:

3

I'm using .NET Regular Expressions to strip HTML code.

Using something like:

<title>(?<Title>[\w\W]+?)</title>[\w\W]+?<div class="article">(?<Text>[\w\W]+?)</div>

This works for 99% of the time, but sometimes, when parsing...

Regex.IsMatch(HTML, Pattern)

The parser just blocks and it will continue on this line of code for several minutes or indefinitely.

What's going on?

+3  A: 

With some effort, you can make regex work on html - however, have you looked at the HTML agility pack? This makes it much easier to work with html as a DOM, with support for xpath-type queries etc (i.e. "//div[@class='article']").

Marc Gravell
+1  A: 

You're asking your regex to do a lot there. After every character, it has to look ahead to see if the next bit of text can be matched with the next part of the pattern.

Regex is a pattern matching tool. Whilst you can use it for simple parsing, you'd be better off using a specific parser (such as the HTML Agility pack, as mentioned my Marc).

David Kemp
+1 for recommending a parser.
converter42
+6  A: 

Your regex will work just fine when your HTML string actually contains HTML that fits the pattern. But when your HTML does not fit the pattern, e.g. if the last tag is missing, your regex will exhibit what I call "catastrophic backtracking". Click that link and scroll down to the "Quickly Matching a Complete HTML File" section. It describes your problem exactly. [\w\W]+? is a complicated way of saying .+? with RegexOptions.SingleLine.

Jan Goyvaerts