tags:

views:

512

answers:

5

I have a wysiwyg text area in a Java webapp. Users can input text and style it or paste some already HTML-formatted text.

What I am trying to do is to linkify the text. This means, converting all possible URLs within text, to their "working counterpart", i.e. adding < a href="...">...< /a>.

This solution works when all I have is plain text:

String r = "http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
Pattern pattern = Pattern.compile(r, Pattern.DOTALL | Pattern.UNIX_LINES | Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(comment);
comment = matcher.replaceAll("<a href=\"$0\">$0</a>"); // group 0 is the whole expression

But the problem is when there is some already formatted text, i.e. that it already has the < a href="...">...< /a> tags.

So I am looking for some way for the pattern not to match whenever it finds the text between two HTML tags (< a>). I have read this can be achieved with lookahead or lookbehind but I still can't make it work. I am sure I am doing it wrong because the regex still matches. And yes, I have been playing around/ debugging groups, changing $0 to $1 etc.

Any ideas?

A: 

Perhaps html parsing will be more appropriate for you (htmlparser for example). Then you could have html nodes and only "linkify" links in the text and not in the attributes.

kgiannakakis
A: 

If you have to roll your own, at least look at the algorithms/patterns used in an Open Source implementation of Markdown, e.g., MarkdownJ.

Hank Gay
+1  A: 

If you want to use regex, (though I think parsing to XML/HTML first is more robust) I think look-ahead or -behind makes sense. A first stab might be to add this at the end of your regex:

(?!</a>)

Meaning: don't match if there's a closing a tag just afterwards. (This could be tweaked forever, of course.) This doesn't work well, though, because given the string

<a href="...">http://example.com/&lt;/a&gt;

This regex will try to match "http://example.com/", fail due to the lookahead (as we hope), and then backtrack the greedy qualifier to have on the end and match "http://example.com" instead, which doesn't have a after it.

You can fix the latter problem by using a possessive qualifier on your +, * and ? operators - just stick a + after them. This prevents them from back-tracking. This is probably good for performance reasons, as well.

This works for me (note the three extra +'s):

String r = "http(s)?://([\\w+?\\.\\w+])++([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*+)?+(?!</a>)";
Jesse Rusak
thanks, i will consider this improvement
frank06
+7  A: 

You are close. You can use a "negative lookbehind" like so:

(?<!href=")http:// etc

All results preceded by href will be ignored.

Kees de Kooter
thanks, it was exactly this i was needing... i was very close indeed!
frank06
I always carry the "Regular Expression Pocket Reference" with me ;-)
Kees de Kooter
+1  A: 

If you really want to do it with regex, than:

   String r = "(?<![=\"\\/>])http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";

e.g. check that the URL is not following a =" or />

siddhadev