tags:

views:

87

answers:

3

I want to metch a keyword that is not linked, as the following example shows, I just match the google keyword that is neither between <a></a> nor included in the attributes, I only want to match the last google:

<a href="http://www.google.com" title="google">google</a> is linked, google is not linked.

A: 

Provided you can be sure that your HTML is well behaved (and valid), especially does not contain comments or nested a tags, you can try

google(?!((?!<a[\s>]).)*</a>)

That matches any "google" that is not followed by a closing a tag before the next opening a tag. But you might be better of using a HTML Parser instead.

Jens
@Jens, `(\s|>)` would be better written as a character class: `[\s>]`. A character class is much, much more efficient than an equivalent alternation. It probably doesn't matter in this case, see this recent question for a demonstration: http://stackoverflow.com/questions/3176825/unicode-regular-expressions-fails-at-343-characters
Alan Moore
@Alan: Thanks for the hint!
Jens
-1 for parsing HTML with a regular expression; this regex can mismatch with XHTML CDATA or HTML comments.
Borealid
@Borealid: Thats why I said that the HTML should not contain comments. I agree that this is not the way the problem SHOULD be solved, but I don't think the standard "regex is evil" answer is going to help the OP with his problem in any way.
Jens
This pattern also matches the keyword (google) in the html attributes, such as <a http="http://www.google.com">XXX</a>, which I do not want to be matched.Thanks all the same!
James Tang
@James: Does it? It didn't do this in my tests....
Jens
A: 

This works for me (javascript):

var matches = str.match(/(?:<a[^>]*>[^<]*<\/a>[\s\S]*)*(google)/);

See it in action

galambalazs