ansaurus

Question

A URL that contains all valid characters to test my regex pattern?

Answer 1

A:

The pattern you linked to does indeed match a lot of URLs, both valid and invalid. It's not really a surprise since nearly everything in that regex is optional; as you wrote yourself, it even matches bit.ly, so it's easy to see how it would match lots of non-URL stuff.

It doesn't take new Unicode domain names into account, for one (e.g., http://www.müller.de).

It doesn't match valid URLs like

http://msdn.microsoft.com/en-us/library/aa752574(VS.85).aspx

It doesn't match relative paths (might not be necessary, though) like /cgi-bin/version.pl.

It doesn't match mailto: links.

It doesn't match URLs like http://1.2.3.4. Don't even ask about IPv6 :)

All in all, regular expressions are NOT the right tool to reliably match or validate URLs. This is a job for a parser. If you can live with many false positive and false negative matches, then regexes are fine.

Please read Jan Goyvaerts' excellent essay on this subject: Detecting URLs in a block of text.

Tim Pietzcker 2010-09-21 06:19:46

First of all thanks for the examples. I know the regex is not perfect, indeed I had to fix some of the conditions. On the unicode domains, which characteres are valid? all of the char table? On the msdn URL, the regex was missing parenthesis, that was the kind of URL I wanted to test. I'm not sure I want to find mailto:, but thanks for pointing it out. With my fix it does find 0.0.0.0 to 999.999.999.999, of course that will always give false positives (think phone numbers). Should I give IPv6 support yet? Not sure. Regards!

cronocr 2010-09-21 15:06:10

Something more, I checked the link to Jan's essay, but I better stick to my regex. First, I don't know if the samples are not escaped for the PERL/PHP syntax, but some of the regex didn't work for me when testing with http://www.spaweditor.com/scripts/regex/index.php So I just looked at their structure, and seems those will find a wider range of strings, of course they are relying on URLs always having the scheme always present, but I want the regex to detect "bit.ly" and so. I'm mostly interested on usefulness for the user. Think on automatic correction in Word.

cronocr 2010-09-21 15:25:39

I'm investigating on Unicode and found this: "The main thing here is that there are a number of characters in Unicode, known as homographs, that visually look the same, e.g. an ASCII 'C' looks like the Cyrillic 'C' for instance, so the attack still works even without resorting to devious encodings.", so I'd better drop unicode to gain security for the user, that would mean more "usefulness".

cronocr 2010-09-21 15:43:35

ansaurus

tags:

views:

answers:

A URL that contains all valid characters to test my regex pattern?

related questions