ansaurus

Question

regex: match string only if not part of a tag

Answer 1

+4 A:

I really wouldn't use regexps to match HTML, since HTML isn't regular and there are a load of edge cases to trip you up. For all but the simplest cases I'd use an HTML parser (e.g. this one for PHP).

Brian Agnew 2009-09-02 08:48:44

you are right - I never found a good HTML parser. But the one you linked to works like a charm! Thanks.Thank you also to nicerobot and vinzz for the regexs

Casper 2009-09-02 16:15:59

Answer 2

A:

Brian has got a point, anyway, if you wish to use a regex, that one suits you inputs:

.*>[^<]*abc[^<]*<.*

Vinzz 2009-09-02 08:54:56

Misses "abc <p>" and "<p> abc".

nicerobot 2009-09-02 11:14:55

but... but it wasn't in the inputs! It's totally unfair ;o)

Vinzz 2009-09-02 13:28:47

Answer 3

A:

I'm quite convinced that any regex is going to break on some CDATA sections.

MSalters 2009-09-02 09:10:19

Answer 4

A:

While I too agree with Brian's comment, i often do quick and dirty parsing with regular expressions, and for your case, i'd use something like this:

"serialize" the data

s/[\r\n]//
s/<!\[CDATA\[.*?]]>//
s/</\n</
s/>/>\n/

then simply filter all lines that begin with <

s/^<.*//

What you're left with is just the text (and possibly a lot of white-space). Though this is less about regular expressions and more about search and replace.

nicerobot 2009-09-02 13:26:39

Answer 5

A:

What you're looking for is a DOM parser. That will strip out all the HTML and provide you the plain text of the page you're examining, which you can then match on. Not sure what your use case is, but I'm not assuming you're not manipulating the DOM, or else you'd be using JavaScript.

If you're just extracting information, parse the page using something like The Simple HTML DOM Parser, and then match against the plain text you can get from the parsed object.

Robert Elwell 2009-09-02 13:33:12

ansaurus

tags:

views:

answers:

regex: match string only if not part of a tag

related questions