views:

212

answers:

3

I have a fairly long and complex HTML document, and I need to find all occurences of a given string, e.g. "foobar", unless it's between <a> and </a> anchor tags.

The trouble is: it could be inside some text between the anchor tags, e.g.

<a>this is a foobar test</a>

and even in this case, I should not find the match.

How can I do that with a regex?? I would have no trouble finding <a>foobar</a> and so on - but finding every "foobar" except when it's between the anchor tags and surrounded by possible a lot of other text seems a bit tricky.....

Any ideas??

ANSWER:
We ended up using this Regex to solve this problem - just in case anyone is a) curious, or b) in the same place :-)

(?<!\<A.*(?=\<\/A))Test(?!\<\/A.*(?=\<A))
+1  A: 

You should be able to do with negative lookahead and loohbehind patterns. Here is a good tutorial:

http://www.regular-expressions.info/lookaround.html

James Conigliaro
@marc_s: which one is that?
SilentGhost
OK, got it to work quite nicely with the regex expression `(?<!<a.*)foobar` - in C# / .NET this seems to apply to each line and for my purposes, that's just fine. Thanks!
marc_s
+2  A: 
'foobar(?![^<]*</a>)'

works for me in the simplest case. it's obviously not resistant to having other tags within a tag.

SilentGhost
the problem with this is it doesn't take into account something like: <a> asdf <b>foobar</b> </a>
Chris
Yes, that works, only if the a-tags don't have other tags in them: '<a> this is a foobar <b> foobar </b> test</a>'.
Bart Kiers
I'd say it works for the vast majority of cases.
SilentGhost
If the OP's content with the majority of cases, then all is fine: your solution is a heck of a lot easier to read than the more complex solution (who cover even more than the vast majority of cases :)).
Bart Kiers
Works nicely - thanks!
marc_s
A: 

Try this:

$str = 'foobar <a>this is a foobar <span>foobar</span> test</a> foobar';

$pattern = '<a(?:[^"\'>]+|"[^"]*"|\'[^\']*\')*>(?:[^<]+|(?!<\/a\s*>)<)*<\/a\s*>';
$parts = preg_split('/('.$pattern.')/', $str, -1, PREG_SPLIT_DELIM_CAPTURE);
$isLink = (bool) preg_match('/^'.$pattern.'$/', $parts[0]);
foreach ($parts as &$part) {
    if (!$isLink) {
        $part = str_replace('foobar', '!!!found!!!', $part);
    }
    $isLink = !$isLink;
}
$str = implode('', $parts);

echo htmlspecialchars($str);
Gumbo
sorry, silly question: what language/script is this?
marc_s
@marc_s: That’s PHP. Sorry, I somehow assumed you asked for a PHP solution. ;-)
Gumbo
thanks! No, I'm dealing with C# / .NET / jQuery here - but thanks anyway - I'll translate and see if I can make sense of it :-)
marc_s
Well, the algorithm is just to split the string at the A elements you want to avoid, iterate the parts and replace *foobar* only in those parts that are not A elements. Since the result of `preg_split` is always an array where two consecutive items are not the same type (A element – anything except an A element), the flag `$isLink` is used to tell the type that is switched on each iteration.
Gumbo