views:

215

answers:

5

Hi,

I am trying to match a string only if it is not part of an html tag.

For example when searching for the string: "abc". <a href="foo.html">abc def</a> should match <p> foo bar foo abc foo bar</p> should match

but <a href="abc.html">foo</a> should not match.

Thanks for the help!

+4  A: 

I really wouldn't use regexps to match HTML, since HTML isn't regular and there are a load of edge cases to trip you up. For all but the simplest cases I'd use an HTML parser (e.g. this one for PHP).

Brian Agnew
you are right - I never found a good HTML parser. But the one you linked to works like a charm! Thanks.Thank you also to nicerobot and vinzz for the regexs
Casper
A: 

Brian has got a point, anyway, if you wish to use a regex, that one suits you inputs:

.*>[^<]*abc[^<]*<.*
Vinzz
Misses "abc <p>" and "<p> abc".
nicerobot
but... but it wasn't in the inputs! It's totally unfair ;o)
Vinzz
A: 

I'm quite convinced that any regex is going to break on some CDATA sections.

MSalters
A: 

While I too agree with Brian's comment, i often do quick and dirty parsing with regular expressions, and for your case, i'd use something like this:

  • "serialize" the data
s/[\r\n]//
s/<!\[CDATA\[.*?]]>//
s/</\n</
s/>/>\n/
  • then simply filter all lines that begin with <
s/^<.*//

What you're left with is just the text (and possibly a lot of white-space). Though this is less about regular expressions and more about search and replace.

nicerobot
A: 

What you're looking for is a DOM parser. That will strip out all the HTML and provide you the plain text of the page you're examining, which you can then match on. Not sure what your use case is, but I'm not assuming you're not manipulating the DOM, or else you'd be using JavaScript.

If you're just extracting information, parse the page using something like The Simple HTML DOM Parser, and then match against the plain text you can get from the parsed object.

Robert Elwell