ansaurus

Question

Regex challenge: Match phrase only if outside of an <a href> tag

Answer 1

A:

You can't solve it because it can't be done, at least not with 100% reliability. HTML is not a "regular" language in the regular expression sense. Like the saying goes, when you have a hammer, everything starts to look like a nail. There are some things regular expressions aren't good at. This is one of them.

Most languages have some form of HTML parsing library as standard or easily obtained. Use those. That's what they were designed for.

cletus 2009-10-13 01:35:54

Answer 2

A:

In general, you can't use a regular expression to recognize arbitrarily nested constructs (such as bracket-delimited HTML tags). If you had solved this problem, there's be a lot of mathematicians lining up to hear about it. :)

Having said that, .NET does indeed offer an extension to regular expressions that permits what I just said was impossible, and--even better!--the sample chapter for the great "Mastering Regular Expressions" available here happens to cover that feature.

Jonathan Feinberg 2009-10-13 01:37:22

This question is about ASP classic

Rex M 2009-10-13 01:39:13

Answer 3

+1 A:

This problem is, as they say, "non-trivial" in its current state. However, if you could modify your system to output more semantic markup, it would make things much easier:

<a href="ROI.htm">undesired tag match</a>
This is <span class="tag">a tag</span>

In this case, you can simply search:

(?<=<span class=\"tag\">)(phrase1|phrase2|phrase3)(?=</span>)

Or something a little more robust

(?<=<span class=\"tag\">).+?(?=</span>)

This way you can easily focus your searches to data within a specific <span>, and leave everything else aside.

Rex M 2009-10-13 01:41:45

It's my database so I have full control over the markup. I can ensure that every <a></a> is written as you've written it above. With that in place, is it possible to do this with negative lookaheads or lookbehinds?

kgaebler 2009-10-13 01:47:44

Rex, if I understand what you are saying, you are saying that if I would just pre-markup all my glossary words with tags, then it would be easy to find them. While I appreciate the feedback, it's not what I am trying to do. I'm trying to modify the article text just prior to serving it such that the glossary entries, as defined in a Glossary Database, show up as links. If I was willing to pre-process the articles and add span tags as you suggest, then I might as well just hard-code the links to the glossary entries.

kgaebler 2009-10-13 02:27:15

@kgaebler indeed, you might as well! You're storing a low-fidelity copy of your data as the master and attempting to reconstruct a higher-fidelity version at extraction time. That's a losing game, as you'll see from the other answers here.

Rex M 2009-10-13 02:37:03

Fair enough. I think you are saying that having articles stored in a database with HTML markup instead of some other semantic markup has me doomed from the start. I don't think I have the smarts to change the DB to a new semantic markup language other than HTML. So, I guess in my loop, I will check for e.g. <a *href *=.*(accounts receivable|A/R).*</a> and if I get a match then I will just skip that glossary entry and go to the next. As such, I will always create glossary links unless the glossary phrase is an <a></a> tag. It's a hack, but I think it may work. Thnks for the help and brain food.

kgaebler 2009-10-13 02:46:29

Answer 4

A:

(accounts receivable|A/R)(?!((?!</?a\b).)*</a)

(phrase1|phrase2|phrase3)(?!((?!</?a\b).)*</a)

The above approach seems to work, at least in my RegexBuddy software. I didn't figure it out on my own. Had some help from a guru. Time to test it in my ASP code. Thanks to all who provided input. I'm sure I didn't describe what I needed well enough for you to come up with the above solution. Mea culpa.

kgaebler 2009-10-13 02:56:28

Answer 5

A:

Try this regex:

<a\b[^<>]*>[\s\S]*?</a>|(ROI|return on investment|investment return)

This matches an HTML anchor, or any of the terms you're looking for. The terms are captured into group number 1. So in your VBScript code, check if the first capturing group matched anything, and you've got one of your keywords outside an <a> tag.

This regex indeed won't work correctly if you have nested <a> tags. That shouldn't be a problem, as anchors are normally not nested inside each other. If it is a problem, you can't solve it with VBScript/JavaScript regular expressions. The regex also won't work correctly if you have <a> tags that are missing their closing tags. If you want to take that into account, try this regex:

<a\b[^<>]*>(?:(?:(?!<a\b)[\s\S])*?</a>)?|(ROI|return on investment|investment return)

Jan Goyvaerts 2009-10-13 08:22:56

OMG. Jan Goyvaerts for real? That is so cool. Dude, RegexBuddy rocks. Thanks tons for the help. I will give it a try.

kgaebler 2009-10-13 17:15:36

I really like the technique. So you put the stuff you want to find in parens on one side of the | and you put the stuff you want to ignore on the other side without the parens and just check if you got a match in the capturing group for a given match. Excellent. Just connected a couple of synapses over here. Thanks.

kgaebler 2009-10-13 17:23:43

That's how it works.

Jan Goyvaerts 2009-10-17 08:53:10

ansaurus

tags:

views:

answers:

Regex challenge: Match phrase only if outside of an <a href> tag

related questions