views:

109

answers:

1

Hi,

I'm currently using the following regular expression to validation URLs:

^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w+:\w+@)?  (?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|edu|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$

I borrowed this from somewhere on the web (don't remember where) to improve upon this:

^((https?|file|ftp|gopher|news|nntp):\/\/)([a-z]([a-z0-9\-]*\.)+([a-z]{2}|aero|arpa|biz|com|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel)|(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))(\/[a-z0-9_\-\.~]+)*(\/([a-z0-9_\-\.]*)(\?[a-z0-9+_\-\.%=&]*)?)?(#[a-z][a-z0-9_]*)?$

However, neither of these are capable of validating this url (which should be valid):

http://somedomain.com/users/1234/images/Staff%20Photos%202008/FirstName%20LastName_1%20(Small).jpg

The problem is the %20 and round brackets (). Try as I might, I couldn't get either of the regex above to correctly validate the url above without breaking something else. I'm not experienced with writing fancy regular expressions, so that doesn't help either. All other web results I've found fail on silly things such as this:

http://www.test..com

Help would be appreciated.

+2  A: 

You're validating two things with the same regular expression:

  • Well formed -- Is it syntactically correct?
  • Plausible -- Are the protocol and top-level domain plausible?

Separating these validations may be fruitful. You can use this regular expression to check that the URI is well-formed. It's from RFC 3986, Uniform Resource Identifiers (URI): Generic Syntax, appendix B (p. 50):

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

If the URI matches this regular expression, it's well formed. The match groups give you the various pieces, which are:

scheme    = $2
authority = $4
path      = $5
query     = $7
fragment  = $9

Let's see what comes out of the sample URI you gave:

2 (scheme)   : "http"
4 (authority): "somedomain.com"
5 (path)     : "/users/1234/images/Staff%20Photos%202008/FirstName%20LastName_1%20(Small).jpg"
7 (query)    : nil
9 (fragment) : nil

Now that you've got the individual pieces, you can check each one for plausibility. For example, to get the TLD from the authority, apply this regular expression to the authority:

\.([^.])$

Group 1 gives you the TLD (com, org, etc.), which you can then check against your list.

Wayne Conrad
I'd actually never heard of NOT using a single regex to test for both form and plausibility. This idea is good, but requires a fair bit more work. Do you have a recommended regex for the (path)?
Chris F
I don't think you need an additional regex for the path. For the authority, use the regex I gave just above to extract it and check it against your list (com, org, etc.). Check the scheme against your list (http, ftp, etc.). I wouldn't check too much--just knowing that it's well formed has already gotten you most of the benefit; more checking will yield incrementally less benefit at the cost of causing you to reject good URIs, either now or in the future when new TLDs and protocols are introduced.
Wayne Conrad