tags:

views:

1141

answers:

6

I have a load of user-submitted content. It is HTML, and may contain URLs. Some of them will be <a>'s already (if the user is good) but sometimes users are lazy and just type www.something.com or at best http://www.something.com.

I can't find a decent regex to capture URLs but ignore ones that are immediately to the right of either a double quote or '>'. Anyone got one?

+2  A: 

See The Problem With URLs for the solution Jeff used on this site.

Sam Hasler
+6  A: 

Jan Goyvaerts, creator of RegexBuddy, has written a response to Jeff Atwood's blog that addresses the issues Jeff had and provides a nice solution.

\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$]

In order to ignore matches that occur right next to a " or >, you could add (?<![">]) to the start of the regex, so you get

(?<![">])\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$]

This will match full addresses (http://...) and addresses that start with www. or ftp. - you're out of luck with addresses like ars.userfriendly.org...

Tim Pietzcker
A: 

Shameless plug: You can look here (regular expression replace a word by a link) for inspiration.

The question asked to replace some word with a certain link, unless there already was a link. So the problem you have is more or less the same thing.

All you need is a regex that matches a URL (in place of the word). The simplest assumption would be like this: An URL (optionally) starts with "http://", "ftp://" or "mailto:" and lasts as long as there are no white-space characters, line breaks, tag brackets or quotes).

Beware, long regex ahead. Apply case-insensitively.

(href\s*=\s*['"]?)?((?:http://|ftp://|mailto:)?[^.,&lt;&gt;"'\s\r\n\t]+(?:\.(?![.&lt;&gt;"'\s\r\n])[^.,!&lt;&gt;"'\s\r\n\t]+)+)

Be warned - this will also match URLs that are technically invalid, and it will recognize things.formatted.like.this as an URL. It depends on your data if it is too insensitive. I can fine-tune the regex if you have examples where it returns false positives.

The regex will produce two match groups. Group 2 will contain the matched thing, which is most likely an URL. Group 1 will either contain an empty string or an 'href="'. You can use it as an indicator that this match occurred inside a href parameter of an existing link and you don't have to do touch that one.

Once you confirm that this does the right thing for you most of the time (with user supplied data, you can never be sure), you can do the rest in two steps, as I proposed it in the other question:

  1. Make a link around every URL there is (unless there is something in match group 1!) This will produce double nested <a> tags for things that have a link already.
  2. Scan for incorrectly nested <a> tags, removing the innermost one
Tomalak
A: 

To skip existing ones just use a look-behind - add (?<!href=") to the beginning of your regular expression, so it would look something like this:

/(?<!href=")http://\S*/

Obviously this isn't a complete solution for finding all types of URLs, but this should solve your problem of messing with existing ones.

Renesis
A: 

I made a slight modification to the Regex contained in the original answer:

(?<![.*">])\b(?:(?:https?|ftp|file)://|[a-z]\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]

which allows for more subdomains, and also runs a more full check on tags. To apply this to PHP's preg replace, you can use:

$convertedText = preg_replace( '@(?<![.*">])\b(?:(?:https?|ftp|file)://|[a-z]\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]@i', '<a href="\0" target="_blank">\0</a>', $originalText );

Note, I removed @ from the regex, in order to use it as a delimiter for preg_replace. It's pretty rare that @ would be used in a URL anyway.

Obviously, you can modify the replacement text, and remove target="_blank", or add rel="nofollow" etc.

Hope that helps.

Hodge
I've added an = to the (?<![.*">]) at the start to not break <a href=http://url/>link</a> (non-quoted anchor tags). Nice regex btw :)
Joel
A: 

Have you got any of those regex in Javascript, I've been uncapable of tranlating the ones provided.

Thanks in advance