ansaurus

Question

Need a regex to validating a Url and support %20 and ()

Answer 1

+2 A:

You're validating two things with the same regular expression:

Well formed -- Is it syntactically correct?
Plausible -- Are the protocol and top-level domain plausible?

Separating these validations may be fruitful. You can use this regular expression to check that the URI is well-formed. It's from RFC 3986, Uniform Resource Identifiers (URI): Generic Syntax, appendix B (p. 50):

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

If the URI matches this regular expression, it's well formed. The match groups give you the various pieces, which are:

scheme    = $2
authority = $4
path      = $5
query     = $7
fragment  = $9

Let's see what comes out of the sample URI you gave:

2 (scheme)   : "http"
4 (authority): "somedomain.com"
5 (path)     : "/users/1234/images/Staff%20Photos%202008/FirstName%20LastName_1%20(Small).jpg"
7 (query)    : nil
9 (fragment) : nil

Now that you've got the individual pieces, you can check each one for plausibility. For example, to get the TLD from the authority, apply this regular expression to the authority:

\.([^.])$

Group 1 gives you the TLD (com, org, etc.), which you can then check against your list.

Wayne Conrad 2010-01-18 01:57:42

I'd actually never heard of NOT using a single regex to test for both form and plausibility. This idea is good, but requires a fair bit more work. Do you have a recommended regex for the (path)?

Chris F 2010-01-21 22:45:39

I don't think you need an additional regex for the path. For the authority, use the regex I gave just above to extract it and check it against your list (com, org, etc.). Check the scheme against your list (http, ftp, etc.). I wouldn't check too much--just knowing that it's well formed has already gotten you most of the benefit; more checking will yield incrementally less benefit at the cost of causing you to reject good URIs, either now or in the future when new TLDs and protocols are introduced.

Wayne Conrad 2010-01-21 23:51:42

ansaurus

tags:

views:

answers:

Need a regex to validating a Url and support %20 and ()

related questions