tags:

views:

41

answers:

1

Hello,

First of all I created my own regex to find all URLs in a text, because:

  1. When I searched SO and google only found regex for specific URL constructions, like images, etc.
  2. I found a pretty complete regex from the PHP's manual itself (see "splattermania at freenet dot de 01-Oct-2009 12:01" post at http://php.net/manual/en/function.preg-match.php) that can find almost anything that resembles a URL, as little as "bit.ly".
  3. This pattern has a few errors and constraints, so I'm fixing and enhancing it.

Now the pattern structure seems right, but I'm not sure all valid characters are present. Please post samples of URLs to test my pattern. Might be laziness, but I don't want to read pages and pages of references to find all of them, need to focus on the development. If you have a summary of valid chars for username, password, path, query and anchor that you can share, that would be very very helpful.

Best Regards!

A: 

The pattern you linked to does indeed match a lot of URLs, both valid and invalid. It's not really a surprise since nearly everything in that regex is optional; as you wrote yourself, it even matches bit.ly, so it's easy to see how it would match lots of non-URL stuff.

It doesn't take new Unicode domain names into account, for one (e.g., http://www.müller.de).

It doesn't match valid URLs like

http://msdn.microsoft.com/en-us/library/aa752574(VS.85).aspx

It doesn't match relative paths (might not be necessary, though) like /cgi-bin/version.pl.

It doesn't match mailto: links.

It doesn't match URLs like http://1.2.3.4. Don't even ask about IPv6 :)

All in all, regular expressions are NOT the right tool to reliably match or validate URLs. This is a job for a parser. If you can live with many false positive and false negative matches, then regexes are fine.

Please read Jan Goyvaerts' excellent essay on this subject: Detecting URLs in a block of text.

Tim Pietzcker
First of all thanks for the examples. I know the regex is not perfect, indeed I had to fix some of the conditions. On the unicode domains, which characteres are valid? all of the char table? On the msdn URL, the regex was missing parenthesis, that was the kind of URL I wanted to test. I'm not sure I want to find mailto:, but thanks for pointing it out. With my fix it does find 0.0.0.0 to 999.999.999.999, of course that will always give false positives (think phone numbers). Should I give IPv6 support yet? Not sure. Regards!
cronocr
Something more, I checked the link to Jan's essay, but I better stick to my regex. First, I don't know if the samples are not escaped for the PERL/PHP syntax, but some of the regex didn't work for me when testing with http://www.spaweditor.com/scripts/regex/index.php So I just looked at their structure, and seems those will find a wider range of strings, of course they are relying on URLs always having the scheme always present, but I want the regex to detect "bit.ly" and so. I'm mostly interested on usefulness for the user. Think on automatic correction in Word.
cronocr
I'm investigating on Unicode and found this: "The main thing here is that there are a number of characters in Unicode, known as homographs, that visually look the same, e.g. an ASCII 'C' looks like the Cyrillic 'C' for instance, so the attack still works even without resorting to devious encodings.", so I'd better drop unicode to gain security for the user, that would mean more "usefulness".
cronocr