ansaurus

Question

Answer 1

+1 A:

Regular expressions are meant to parse regular languages - those that can be described with finite automata. HTML is not a regular language. Parsing HTML with regular expressions is the Ctuhlu way: http://www.codinghorror.com/blog/archives/001311.html.

Alex 2009-12-02 20:46:01

Answer 2

+1 A:

It is really simple: extract only the text with an HTML parser, then use regular expressions on that.

Svante 2009-12-02 20:46:46

Answer 3

A:

HTML should not be parsed with regex because it's not a regular language. You might be able to do it to properly form XHTML, but I wouldn't recommend it. See the most voted up answer on SO

Malfist 2009-12-02 21:05:24

ansaurus

tags:

views:

answers:

regex - match not in tag

related questions