tags:

views:

96

answers:

2

I'm trying to find a single regular expression that I can use to parse a block of HTML to find some specific text, but only if that text is not part of an existing hyperlink. I want to turn the non-links into links, which is easy, but identifying the non-linked ones with a single expression seems more troublesome. In the following example:

  This problem is a result of BugID 12.
  If you want more information, refer to <a href="/bug.aspx?id=12">BugID 12</a>.

I want a single expression to find "BugID 12" so I can link it, but I don't want to match the second one because it's already linked.

In case it matters, I'm using .NET's regular expressions.

+2  A: 

Don't do it! See Jeff Atwood's Parsing Html The Cthulhu Way!

David Pfeffer
And here I thought all things Cthulhu were good and holy. How wrong I was!
Jason D
I re-read this article and am considering its implications on my decision. Adding a new framework to deal with HTML isn't something I really wanted to add to my application, but I understand Jeff's point.
mk
+1  A: 

If .Net supports negative look aheads (which I think it does):

(BugID 12)(?!</a>)  // match BugID 12 if it is not followed by a closing anchor tag.

However, there is still the danger that BugID 12 will be inside an anchor like

<a href="...">Something BugID 12 Something</a>

But you can mostly overcome this with

(BugID 12)(?!(?:\s*\w*)*</a>)  // (?:\s*\w*)* matches any word characters or spaces between the string and the end tag.

Disclaimer: Parsing html with regex is not reliable and should only be done as a last resort, or in the most simple of cases. I'm sure there are plenty of instances where the above expression does not perform as desired. (example: BugID 12</span></a>)

Joel Potter
Thank you; this has given me enough to go on.
mk