views:

606

answers:

2

Hi all,

I'm a relative newb when it comes to regexes, but i'm starting to get the hang of it. I started writing a method in java to "linkify" a string - that is, scan it for any references of urls (i.e, "http://...") or strings that look like web addresses ("www.example.com...")

So, for example, if I had a string that looked like this:

My favorite site is http://www.example.com.  What is yours?

After running it through the method, you'd get a string back that said:

My favorite site is <a href="http://www.example.com"&gt;http://www.example.com&lt;/a&gt;.  What is yours?

After scouring the web for a while, I was finally able to piece together parts of different expressions that help me do what i'm looking for (Some examples include trailing periods at the end of urls in the actual url, some encode urls already in anchor tags, etc.)

Here is what I have so far:

public static String toLinkifiedString(String s, IAnchorBuilder anchorBuilder)
{
 if (IsNullOrEmpty(s))
 {
  return Empty;
 }

 String r = "(?<![=\"\"\\/>])(www\\.|(http|https|ftp|news|file)(s)?://)([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?([^.|'|# |!])";

 Pattern pattern = Pattern.compile(r, Pattern.DOTALL | Pattern.UNIX_LINES | Pattern.CASE_INSENSITIVE);
 Matcher matcher = pattern.matcher(s);
 if (anchorBuilder != null)
 {
  return matcher.replaceAll(anchorBuilder.createAnchorFromUrl("$0"));
 }
 return matcher.replaceAll("<a href=\"$0\">$0</a>"); // group 0 is the whole expression
}

public interface IAnchorBuilder
{
 public String createAnchorFromUrl(String url);
}

There is also simple verion of toLinkifiedString which only takes the string s - it just calls toLinkifiedString(s, null)

So like I said, this pattern is catching everything I need it to catch, and the replaceAll is working great for every case, except for when a link begins with www. If the match begins with "www" instead of a protocol, like "http" or "ftp", I want to conditionally prepend "http://" in front of the resultant link. That is:

MyClass.toLinkifiedString("go to www.example.org")

should return

go to <a href="http://www.example.com"&gt;www.example.org&lt;/a&gt;

The matching groups are as follows:

  • $0 - the actual url that gets found: http://www.example.org or www.example.net
  • $1 - the protocol match ("http://" or "www" for links w/o protocols)

I suppose what I want to be able to do, in pseudocode is something like:

matcher.replaceAll("<a href="(if protocol = "www", insert "http://" + url - otherwise, insert url">url</a>"

Is this possible? Or should I just be happy with being able to only create anchors from links that begin with "http://..." :)

Thanks for any help anyone can offer

+3  A: 

Looks like you are in need of a callback function that returns a dynamic result you can use instead of the fixed string you currently have in replaceAll().

I guess you can make something out of the accepted answer to this question: Java equivalent to PHP's preg_replace_callback.

Tomalak
Here's another one: http://elliotth.blogspot.com/2004/07/java-implementation-of-rubys-gsub.html
Alan Moore
+2  A: 

For your specific problem, definitely go with a callback function as Tomalak says.

For the problem of all those slashes, and the assorted other oddities...

Here is your current Java regex split across lines:

(?<![=\"\"\\/>])
(www\\.|(http|https|ftp|news|file)(s)?://)
([\\w+?\\.\\w+])+
([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?
([^.|'|# |!])

And the same thing as a non-Java regex (no Java string escapes):

(?<![=""\/>])
(www\.|(http|https|ftp|news|file)(s)?://)
([\w+?\.\w+])+
([a-zA-Z0-9\~\!\@\#\$\%\^\&amp;\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?
([^.|'|# |!])


And here's a description of what's wrong with it... :)

Line one - you're duplicating " in the character class, and don't need to escape /

Line two - ok, except I'm not sure what you're after with the (s)? part, since you have https within the previous group anyway.

Line three - you are aware that you've got a character class there? quantifiers don't work. You probably want (\w+?\.\w+)+ instead. (That's (\\w+?\\.\\w+)+ in a Java string.)

Line four - wow, what a lot of escaping!! Almost all unnecessary. Give this a go: ([a-zA-Z0-9~!@#$%^&*()_\-=+\/?.:;',]*)? (and again: ([a-zA-Z0-9~!@#$%^&*()_\\-=+\\/?.:;',]*)? )

Line five - alternation doesn't do anything inside a character class. This'll do: [^.'#!] , and add a single | if you actually want to prevent the pipe char from being there.

Putting all those comments together provides this regex:

(?<![="/>])
(www\.|(http|https|ftp|news|file)://)
(\w+?\.\w+)+
([a-zA-Z0-9~!@#$%^&*()_\-=+\/?.:;',]*)?
([^.'# !])

Or, yet again, with escaping for Java:

(?<![=\"/>])
(www\\.|(http|https|ftp|news|file)://)
(\\w+?\\.\\w+)+
([a-zA-Z0-9~!@#$%^&*()_\\-=+\\/?.:;',]*)?
([^.'# !])

Note how much simpler that is!

Going back on a single line for that gives:

(?<![="/>])(www\.|(http|https|ftp|news|file)://)(\w+?\.\w+)+([a-zA-Z0-9~!@#$%^&*()_\-=+\/?.:;',]*)?([^.'# !])

or

(?<![=\"/>])(www\\.|(http|https|ftp|news|file)://)(\\w+?\\.\\w+)+([a-zA-Z0-9~!@#$%^&*()_\\-=+\\/?.:;',]*)?([^.'# !])

But I'd stick to the multiline one - just plonk (?x) at the very start and it is a valid regex that ignores the whitespace, and you can use #s for commenting - always a good thing with regexes as long as this!

Peter Boughton
+1 for taking the time! :-)
Tomalak
Though probably I would have left off the escaping of the backslashes and quotes, since this is a Java String requirement, not a regex requirement. Much of the uncertainty comes from the fact that people constantly keep confusing what escaping is required by what system - the experienced because they know, the unexperienced because they don't, ironically.
Tomalak
Hmmm, good point. I've gone and added examples without escaping to the answer. Hopefully I've not made it too confusing having both though... maybe I should completely remove the Java ones and just have a quick line or two about escaping?
Peter Boughton
Thanks for taking the time to *thoroughly* explain :)The reason for the escaping is actually more Intellij than me - it actually automatically escapes strings when you paste them in, a behavior that can grow quite annoying in some cases.
mjd79