ansaurus

Question

.NET Regular Expressions in Infinite Cycle

Answer 1

+3 A:

With some effort, you can make regex work on html - however, have you looked at the HTML agility pack? This makes it much easier to work with html as a DOM, with support for xpath-type queries etc (i.e. "//div[@class='article']").

Marc Gravell 2008-11-27 15:08:48

Answer 2

+1 A:

You're asking your regex to do a lot there. After every character, it has to look ahead to see if the next bit of text can be matched with the next part of the pattern.

Regex is a pattern matching tool. Whilst you can use it for simple parsing, you'd be better off using a specific parser (such as the HTML Agility pack, as mentioned my Marc).

David Kemp 2008-11-27 15:10:13

+1 for recommending a parser.

converter42 2008-11-28 20:39:51

Answer 3

+6 A:

Your regex will work just fine when your HTML string actually contains HTML that fits the pattern. But when your HTML does not fit the pattern, e.g. if the last tag is missing, your regex will exhibit what I call "catastrophic backtracking". Click that link and scroll down to the "Quickly Matching a Complete HTML File" section. It describes your problem exactly. [\w\W]+? is a complicated way of saying .+? with RegexOptions.SingleLine.

Jan Goyvaerts 2008-11-27 17:52:00

ansaurus

tags:

views:

answers:

.NET Regular Expressions in Infinite Cycle

related questions