views:

106

answers:

5

Hello

I've a basic URL validation in my appliction. Right now i'm using the following code.

//validates whether the given value is 
//a valid URL
function validateUrl(value)
{
    var regexp = /(ftp|http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?/
    return regexp.test(value);
}

But right now it is not accepting URLs without the protocol. For ex. if i provide www.google.com it is not accepting it. How can i modify the RegEx to make it accept URLs without protocol?

+2  A: 

Make protocol optional with (...)?

/(((ftp|http|https):\/\/)|(\/\/))?(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?/
hsz
This moves the ftp/http/https to group 2, and doesn't accept `//server` URLs.
Peter Boughton
Look at my edit - now it accepts `protocol://` or `//` or none of them.
hsz
Also you can use `(?:...)` to exclude group from the results.
hsz
That's over-complicating things still, and doesn't work with `http:google.com` either (hence why in my answer I simply used two optional groups). Also the parens wrapping the two sides of the alternation are redundant and just make things messier.
Peter Boughton
+2  A: 

Change the regex to:

/((ftp|http|https):\/\/)?(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?/
tdammers
As with hsz's answer, this moves the ftp/http/https to group 2, and doesn't accept `//server` URLs.
Peter Boughton
+1  A: 

I am not an regex expert, but surrounding the protocol with another bracket and using a question mark at the end should make it optional:

function validateUrl(value)
{
    var regexp = /((ftp|http|https):\/\/)?(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?/
    return regexp.test(value);
} 
Kau-Boy
Again, if this regex was used to capture URL parts, it's creating unnecessary groups, and it's incorrectly combining the `//` with the protocol, which excludes valid URLs.
Peter Boughton
Although //google.com works, it is not a valid URL and I don't think that most people knows that it would work and therefore it can be very useful to exclude such URLs from the validation. Not because it is possible it has to be valid in every form. The double slashes are only something in between as the dots are betwenn subdomain, domain or TLD.
Kau-Boy
The double slashes are the prefix to the path, whilst the colon is the seperator with the protocol - they are two distinct parts that just happen to occur together.(This is detailed in "3. URI Syntactic Components" of RFC 2396)Using //google.com is a valid relative Url (Again, see appendix "C.1 Normal Examples" of RFC 2396) and it does occur "in the wild".
Peter Boughton
Your Regex is accepting '@@##$$' as a valid URL. Any ideas?
NLV
A: 

Change the first part to:

(?:(ftp|http|https):)?(?:\/\/)?

The (?:...) will group content without using capturing groups (so the actual protocol remains in first group).

Note how the protocol: and // parts are individually optional - since //www.google.com is a valid (relative) URL.

Peter Boughton
The colon does not depend to the protocol: http://tools.ietf.org/html/rfc2396
Kau-Boy
Not clear what you're saying there, and that's a long document - can you refer to the specific section you're referring to? I tried (for example) `://google.com` in Chrome and IE and it doesn't work, although it looks like Firefox accepts it.
Peter Boughton
The schema setion include only the name of the protocol (like 'http', 'ftp') but not the colon. So even your regex doesn't split up all groups correctly. But as NLV only wanted to have a validation regex for valid and common (and not only working) URL, there is not need to use a group around the slashes.
Kau-Boy
The inner group captures the value of `http` or `ftp` or whatever, the outer group (where the colon is) is non-capturing, and is necessary to make the whole thing optional. Similarly, the non-capturing group around the slashes is required to make the whole thing optional (it could use `\/{0,2}` but that would allow `/google.com` which might not be desired).
Peter Boughton
+2  A: 

Here's a big long regex for matching a URL:

(?i)\b((?:(?:[a-z][\w-]+:)?(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

The expanded version of that (to help make it understandable):

(?xi)
\b
(                           # Capture 1: entire matched URL
  (?:
    (?:[a-z][\w-]+:)?                # URL protocol and colon
    (?:
      /{1,3}                        # 1-3 slashes
      |                             #   or
      [a-z0-9%]                     # Single letter or digit or '%'
                                    # (Trying not to match e.g. "URI::Escape")
    )
    |                           #   or
    www\d{0,3}[.]               # "www.", "www1.", "www2." … "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
  )
  (?:                           # One or more:
    [^\s()<>]+                      # Run of non-space, non-()<>
    |                               #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                           # End with:
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                                   #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct chars
  )
)

These both come from this page, but modified slightly to make protocol properly optional - you should read that page to help understand what it's doing, and it also has a variant which only matched web-based URLs, which you may want to take a look at too.

Peter Boughton
Thanks for your effort. Let me do a check on it.
NLV