views:

311

answers:

5

I am working on improving our glossary functionality in a custom CMS that is running with classic ASP (ASP 3.0) on IIS with VBScript code. I am stumped on a regex challenge I cannot solve.

Here is the current code:

     If InStr(ART_ArticleBody, "href") = False then
   sql="SELECT URL, Term, RegX FROM GLOSSARYDB;"
   Set rsGlossary = Server.CreateObject("ADODB.Recordset")
   rsGlossary.open sql, strSQLConn
   Set RegExObject = New RegExp
      While Not rsGlossary.EOF
      URL = rsGlossary("URL")
      Phrase = rsGlossary("RegX")
      With RegExObject
     .Pattern = Phrase
     .IgnoreCase = true
     .Global = false
      End With
      set expressionmatch = RegExObject.Execute(ART_ArticleBody)
      if expressionmatch.count > 0 then
      For Each expressionmatched in expressionmatch
      RegExObject.Pattern = Phrase
      URL = "<a href=" & URL & ">"& expressionmatched.Value & "</a>"
     ART_ArticleBody = RegExObject.Replace(ART_ArticleBody, URL)
      next
      end if
      rsGlossary.movenext
      wend
      rsGlossary.movefirst
   Set RegExObject = nothing
  end if

Instead of skipping putting glossary links in any article that has an href in it, as the above code does, I would like to change the code to process every article but have the RegEx pattern avoid matching on a glossary entry if the match is inside of an a tag.

For example, in italics below is a test example for this regex entry in my DB: ROI|return on investment|investment return

Here is a link that uses the glossary term: <a href="ROI.htm">Info on return on investment</a>. Now, here is the glossary term in plain text, not inside of a link: return on investment. We want to find the third instance of a match but not find the first two because they are both inside of a HTML link.

In the above text, if I were processing the article for the glossary entry "ROI|return on investment|investment return" I do not want to match on the first or second occurance that match because they are in an a tag. I need the regex pattern to skip over those matches and just match on any that are not inside of an a tag.

Any help on this would be greatly appreciated.

A: 

You can't solve it because it can't be done, at least not with 100% reliability. HTML is not a "regular" language in the regular expression sense. Like the saying goes, when you have a hammer, everything starts to look like a nail. There are some things regular expressions aren't good at. This is one of them.

Most languages have some form of HTML parsing library as standard or easily obtained. Use those. That's what they were designed for.

cletus
A: 

In general, you can't use a regular expression to recognize arbitrarily nested constructs (such as bracket-delimited HTML tags). If you had solved this problem, there's be a lot of mathematicians lining up to hear about it. :)

Having said that, .NET does indeed offer an extension to regular expressions that permits what I just said was impossible, and--even better!--the sample chapter for the great "Mastering Regular Expressions" available here happens to cover that feature.

Jonathan Feinberg
This question is about ASP classic
Rex M
+1  A: 

This problem is, as they say, "non-trivial" in its current state. However, if you could modify your system to output more semantic markup, it would make things much easier:

<a href="ROI.htm">undesired tag match</a>
This is <span class="tag">a tag</span>

In this case, you can simply search:

(?<=<span class=\"tag\">)(phrase1|phrase2|phrase3)(?=</span>)

Or something a little more robust

(?<=<span class=\"tag\">).+?(?=</span>)

This way you can easily focus your searches to data within a specific <span>, and leave everything else aside.

Rex M
It's my database so I have full control over the markup. I can ensure that every <a></a> is written as you've written it above. With that in place, is it possible to do this with negative lookaheads or lookbehinds?
kgaebler
Rex, if I understand what you are saying, you are saying that if I would just pre-markup all my glossary words with tags, then it would be easy to find them. While I appreciate the feedback, it's not what I am trying to do. I'm trying to modify the article text just prior to serving it such that the glossary entries, as defined in a Glossary Database, show up as links. If I was willing to pre-process the articles and add span tags as you suggest, then I might as well just hard-code the links to the glossary entries.
kgaebler
@kgaebler indeed, you might as well! You're storing a low-fidelity copy of your data as the master and attempting to reconstruct a higher-fidelity version at extraction time. That's a losing game, as you'll see from the other answers here.
Rex M
Fair enough. I think you are saying that having articles stored in a database with HTML markup instead of some other semantic markup has me doomed from the start. I don't think I have the smarts to change the DB to a new semantic markup language other than HTML. So, I guess in my loop, I will check for e.g. <a *href *=.*(accounts receivable|A/R).*</a> and if I get a match then I will just skip that glossary entry and go to the next. As such, I will always create glossary links unless the glossary phrase is an <a></a> tag. It's a hack, but I think it may work. Thnks for the help and brain food.
kgaebler
A: 
(accounts receivable|A/R)(?!((?!</?a\b).)*</a)

(phrase1|phrase2|phrase3)(?!((?!</?a\b).)*</a)

The above approach seems to work, at least in my RegexBuddy software. I didn't figure it out on my own. Had some help from a guru. Time to test it in my ASP code. Thanks to all who provided input. I'm sure I didn't describe what I needed well enough for you to come up with the above solution. Mea culpa.

kgaebler
A: 

Try this regex:

<a\b[^<>]*>[\s\S]*?</a>|(ROI|return on investment|investment return)

This matches an HTML anchor, or any of the terms you're looking for. The terms are captured into group number 1. So in your VBScript code, check if the first capturing group matched anything, and you've got one of your keywords outside an <a> tag.

This regex indeed won't work correctly if you have nested <a> tags. That shouldn't be a problem, as anchors are normally not nested inside each other. If it is a problem, you can't solve it with VBScript/JavaScript regular expressions. The regex also won't work correctly if you have <a> tags that are missing their closing tags. If you want to take that into account, try this regex:

<a\b[^<>]*>(?:(?:(?!<a\b)[\s\S])*?</a>)?|(ROI|return on investment|investment return)
Jan Goyvaerts
OMG. Jan Goyvaerts for real? That is so cool. Dude, RegexBuddy rocks. Thanks tons for the help. I will give it a try.
kgaebler
I really like the technique. So you put the stuff you want to find in parens on one side of the | and you put the stuff you want to ignore on the other side without the parens and just check if you got a match in the capturing group for a given match. Excellent. Just connected a couple of synapses over here. Thanks.
kgaebler
That's how it works.
Jan Goyvaerts