views:

78

answers:

3

I would like to write a regular expression in javascript to match specific text, only when it is not part of an html link, i.e.

match <a href="/link/page1">match text</a>

would not be matched, but

match text

or

<p>match text</p>

would be matched.

(The "match text" will change each time the search is run - I will use something like

var tmpStr = new RegExp("\bmatch text\b","g");

where the value of "match text" is read from a database.)

So far my best effort at a regular expression is

\bmatch text\b(?!</a>)

This deals with the closing , but not the initial . This will probably work fine for my purposes, but it does not seem ideal. I'd appreciate any help with refining the regular expression.

+2  A: 

See this previous SO question.

Amber
+2  A: 

You can use a negative look-behind to get the opening <a href=...:

var tmpStr = new RegExp('(?<!<a.*?>)match text(?!</a>)');

Hope that works for you.

Eric Wendelin
did you mean "(?!<a.*?>)match text(?!</a>)" ? - this is exactly what i was looking for, thank you very much
Tomba
Note that this won't avoid matching, say, the `match text` inside of `<a href="...">test match text foo</a>`.
Amber
@Dav: Right, sorry, didn't take it that far. Though it sounds like it is hard/impossible to handle every case ;)
Eric Wendelin
@Tomba, I believe `(?<!<a.*?>)` *is* what Eric intended to write. You were using a negative lookbehind, weren't you Eric? Trouble is, JavaScript doesn't support lookbehinds. But even if it did, regexes would be useless for this task unless you could simplify the problem somehow, as Dav suggested above.
Alan Moore
+2  A: 

Thanks for the very quick and helpful answers. Just to clarify, the regular expression I ended up using was

(?!<a.*?>)\bmatch text\b(?!</a>)
Tomba
You realize that the above expression will match `<a href="test.html">match text </a>`, correct? In fact, it will match anything where there's a space or other text before the `</a>`, because the `(?!<a.*?>)` is literally doing nothing - the regex you've posted above is *exactly identical* in function to the 'best effort' posted in your OP: `\bmatch text\b(?!</a>)` - why? Because `(?!<a.*?>)\b` is identical to `\b` - a lookahead for something that is not a word border, followed by a requirement word border, will only match a word border.
Amber
Essentially, there are two cases here: either you need to match `match text` anywhere except where it is the *only thing* inside the link (i.e. `<a href="...">match text</a>`, no spaces, no other tags, nothing) - in which case your regex in the OP already would have worked fine without modification; or you need to match the text but only if it's not inside a link, even if wrapped in other text (i.e. `<a href="..."><strong>match text</strong></a>` *shouldn't* be matched), in which case the regex above won't work. Either way, you don't actually gain anything from adding `(?!<a.*?>` to the front.
Amber
@Dav - Thanks for explaining that. though it's not obvious from the question, I would ideally like to match "match text" wherever it is enclosed in <a..> </a> tags, whether or not there are spaces (basically what I want to do is find a certain string, then convert it into a link, unless it's already a link). However, I can be 99% sure that the match will be the only thing inside the link, so the original (OP) regular expression will probably work fine in practice.
Tomba
Thanks for Dav and others for pointing out my answer does not make sense! I have not deleted the answer because of the valuable comments below.
Tomba