tags:

views:

530

answers:

3

I need to do a regex replacement where I take a string and wrap a hyperlink around it (but here's the catch) as long as it isn't already surrounded by a hyperlink. How would I do this?

So, for example, let's take the text:

The quick brown fox.

I want to make "quick brown" a link, like this:

The <a href="http://www.stackoverflow.com/"&gt;quick brown</a> fox.

But if I find the text:

The <a href="http://www.stackoverflow.com/"&gt;quick brown</a> fox.

I want to be sure that I don't wrap "quick brown" in another hyperlink.

How would I do this?

A: 

it seems as if you are parsing rendered html, if that is the case why not parse the raw html? Then the problem becomes trivial

ennuikiller
I don't see how it becomes trivial. I don't understand the difference between raw and rendered html. html is a format. The browser renders the format into an interface.The documents I'm using the regex against are html documents. So there's no way to remove the html.
Laran Evans
+1  A: 

Lookarounds could get you somewhere. Though not perfect at all, here is a quick regex check to see whether your text has been wrapped in anchor tags already.

(?<=>)quick brown(?=</a>)

Note: lookbehind assertions need to be fixed length (at least in PCRE).

Geert
+1  A: 

If the string you want to wrap a link around is YOUR_STRING, first identify all places where YOUR_STRING is surrounded by a link tag.

regex = <a[^>]*>[^<]*(YOUR_STRING)[^<]*</a>

Starts with <a

Followed by a sequence of length zero or more that doesn't contain > .

Followed by >

Followed by a sequence of length zero or more that doesn't contain <.

Followed by YOUR_STRING This is a capturing group.

Followed by a sequence of length zero or more that doesn't contain <.

Followed by </a>

Now you can identify the character offsets of the places where captured group YOUR_STRING is surrounded by a link tag.

Other than these places, in all other places where YOUR_STRING occurs literally, wrap the link tag around it.

Bonus point: Note that when you insert text into a string, you may change the character offsets, OR your regex may throw a ConcurrentModificationException / not allow you to insert text during analysis time (depending on what library you are using). The best way to handle this is to create a separate StringBuffer and append text to it as your analyze your original string.

Also note: The regex to identify the hyperlink tag can be written more smarter (for correct html) but this should work for bad html too. E.g. with a missing href attribute such as <a>quick brown fox</a>. If the html you are expecting can be imperfect and you would like to handle those issues, then you should modify the regex accordingly.

Hope it works.

hashable