ansaurus

Question

Regex challenge - find "foobar" in HTML document

Answer 1

+1 A:

You should be able to do with negative lookahead and loohbehind patterns. Here is a good tutorial:

http://www.regular-expressions.info/lookaround.html

James Conigliaro 2009-10-02 14:27:46

@marc_s: which one is that?

SilentGhost 2009-10-02 15:01:25

OK, got it to work quite nicely with the regex expression `(?<!<a.*)foobar` - in C# / .NET this seems to apply to each line and for my purposes, that's just fine. Thanks!

marc_s 2009-10-04 14:22:41

Answer 2

+2 A:

'foobar(?![^<]*</a>)'

works for me in the simplest case. it's obviously not resistant to having other tags within a tag.

SilentGhost 2009-10-02 14:30:27

the problem with this is it doesn't take into account something like: <a> asdf <b>foobar</b> </a>

Chris 2009-10-02 14:32:50

Yes, that works, only if the a-tags don't have other tags in them: '<a> this is a foobar <b> foobar </b> test</a>'.

Bart Kiers 2009-10-02 14:34:20

I'd say it works for the vast majority of cases.

SilentGhost 2009-10-02 14:37:48

If the OP's content with the majority of cases, then all is fine: your solution is a heck of a lot easier to read than the more complex solution (who cover even more than the vast majority of cases :)).

Bart Kiers 2009-10-02 14:39:29

Works nicely - thanks!

marc_s 2009-10-04 14:22:02

Answer 3

A:

Try this:

$str = 'foobar <a>this is a foobar <span>foobar</span> test</a> foobar';

$pattern = '<a(?:[^"\'>]+|"[^"]*"|\'[^\']*\')*>(?:[^<]+|(?!<\/a\s*>)<)*<\/a\s*>';
$parts = preg_split('/('.$pattern.')/', $str, -1, PREG_SPLIT_DELIM_CAPTURE);
$isLink = (bool) preg_match('/^'.$pattern.'$/', $parts[0]);
foreach ($parts as &$part) {
    if (!$isLink) {
        $part = str_replace('foobar', '!!!found!!!', $part);
    }
    $isLink = !$isLink;
}
$str = implode('', $parts);

echo htmlspecialchars($str);

Gumbo 2009-10-02 14:35:44

sorry, silly question: what language/script is this?

marc_s 2009-10-02 18:13:51

@marc_s: That’s PHP. Sorry, I somehow assumed you asked for a PHP solution. ;-)

Gumbo 2009-10-03 05:59:46

thanks! No, I'm dealing with C# / .NET / jQuery here - but thanks anyway - I'll translate and see if I can make sense of it :-)

marc_s 2009-10-03 09:56:04

Well, the algorithm is just to split the string at the A elements you want to avoid, iterate the parts and replace *foobar* only in those parts that are not A elements. Since the result of `preg_split` is always an array where two consecutive items are not the same type (A element – anything except an A element), the flag `$isLink` is used to tell the type that is switched on each iteration.

Gumbo 2009-10-03 10:14:36

ansaurus

tags:

views:

answers:

Regex challenge - find "foobar" in HTML document

related questions