Hi all,
I'm a relative newb when it comes to regexes, but i'm starting to get the hang of it. I started writing a method in java to "linkify" a string - that is, scan it for any references of urls (i.e, "http://...") or strings that look like web addresses ("www.example.com...")
So, for example, if I had a string that looked like this:
My favorite site is http://www.example.com. What is yours?
After running it through the method, you'd get a string back that said:
My favorite site is <a href="http://www.example.com">http://www.example.com</a>. What is yours?
After scouring the web for a while, I was finally able to piece together parts of different expressions that help me do what i'm looking for (Some examples include trailing periods at the end of urls in the actual url, some encode urls already in anchor tags, etc.)
Here is what I have so far:
public static String toLinkifiedString(String s, IAnchorBuilder anchorBuilder)
{
if (IsNullOrEmpty(s))
{
return Empty;
}
String r = "(?<![=\"\"\\/>])(www\\.|(http|https|ftp|news|file)(s)?://)([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?([^.|'|# |!])";
Pattern pattern = Pattern.compile(r, Pattern.DOTALL | Pattern.UNIX_LINES | Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(s);
if (anchorBuilder != null)
{
return matcher.replaceAll(anchorBuilder.createAnchorFromUrl("$0"));
}
return matcher.replaceAll("<a href=\"$0\">$0</a>"); // group 0 is the whole expression
}
public interface IAnchorBuilder
{
public String createAnchorFromUrl(String url);
}
There is also simple verion of toLinkifiedString which only takes the string s - it just calls toLinkifiedString(s, null)
So like I said, this pattern is catching everything I need it to catch, and the replaceAll is working great for every case, except for when a link begins with www. If the match begins with "www" instead of a protocol, like "http" or "ftp", I want to conditionally prepend "http://" in front of the resultant link. That is:
MyClass.toLinkifiedString("go to www.example.org")
should return
go to <a href="http://www.example.com">www.example.org</a>
The matching groups are as follows:
- $0 - the actual url that gets found: http://www.example.org or www.example.net
- $1 - the protocol match ("http://" or "www" for links w/o protocols)
I suppose what I want to be able to do, in pseudocode is something like:
matcher.replaceAll("<a href="(if protocol = "www", insert "http://" + url - otherwise, insert url">url</a>"
Is this possible? Or should I just be happy with being able to only create anchors from links that begin with "http://..." :)
Thanks for any help anyone can offer