tags:

views:

391

answers:

5

How can I write a RE which validates the URLs without the scheme:

Pass:

  • www.example.com
  • example.com

Fail:

A: 

My guess is

/^[\p{Alnum}-]+(\.[\p{Alnum}-]+)+$/

In more primitive RE syntax that would be

/^[0-9A-Za-z-]+(\.[0-9A-Za-z-]+)+$/

Or even more primitive still:

/^[0-9A-Za-z-][0-9A-Za-z-]*\.[0-9A-Za-z-][0-9A-Za-z-]*(\.[0-9A-Za-z-][0-9A-Za-z-]*)*$/
Axeman
A: 

URL syntax is quite complex, you need to narrow it down a bit. You can match anything.ext, if that is enough:

^[a-zA-Z0-9.]+\.[a-zA-Z]{2,4}$
soulmerge
This doesn't work for the example input, which has more than one dot.
brian d foy
+4  A: 
^[A-Za-z0-9][A-Za-z0-9.-]+(:\d+)?(/.*)?$
  • string must start with an ASCII letter or number
  • ASCII letters, numbers, dots and dashes follow (no slashes or colons allowed)
  • optional: a port is allowed (":8080")
  • optional: anything after a slash may follow (since you said "URL")
  • then the end of the string

Thoughts:

  • no line breaks allowed
  • no validity or sanity checking
  • no support for "internationalized domain names" (IDNs)
  • leave off the "optional:" parts if you like, but be sure to include the final "$"

If your regex flavor supports it, you can shorten the above to:

^[A-Za-z\d][\w.-]+(:\d+)?(/.*)?$

Be aware that \w may include Unicode characters in some regex flavors. Also, \w includes the underscore, which is invalid in host names. An explicit approach like the first one would be safer.

Tomalak
Actually the RFC has a domainlabel as the possible first token, defined as alphanum. [ domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum ]. So it looks like your first bullet point is wrong.
Axeman
To me that reads as "(alphanum) or (alphanum plus (any number of alphanum or '-') plus alphanum)". What am I missing here?
Tomalak
Your regex starts with alpha, excluding a legal num in the first position. That's what Axeman is talking about.
brian d foy
Oh, I see. My bad, I was to fixed on the hyphen to notice. :-) Corrected, thanks for pointing out.
Tomalak
+1  A: 

If you're trying to do this for some real code, find the URL parsing library for your language and use that. If you don't want to use it, look inside to see what it does.

The thing that you are calling "resource" is known as a "scheme". It's documented in RFC 1738 which says:

[2.1] ... In general, URLs are written as follows:

   <scheme>:<scheme-specific-part>

A URL contains the name of the scheme being used (<scheme>) followed by a colon and then a string (the <scheme-specific-part>) whose interpretation depends on the scheme.

And, later in the BNF,

scheme = 1*[ lowalpha | digit | "+" | "-" | "." ]

So, if a scheme is there, you can match it with:

/^[a-z0-9+.-]+:/i

If that matches, you have what the URL syntax considers a scheme and your validation fails. If you have strings with port numbers, like www.example.com:80, then things get messy. In practice, I haven't dealt with schemes with - or ., so you might add a real world fudge to get around that until you decide to use a proper library.

Anything beyond that, like checking for existing and reachable domains and so on, is better left to a library that's already figured it all out.

brian d foy
A: 

Thanks guys, I think I have a Python and a PHP solution. Here they are:

Python Solution:

import re

url = 'http://www.foo.com'
p = re.compile(r'^(?!http(s)?://$)[A-Za-z][A-Za-z0-9.-]+(:\d+)?(/.*)?$')
m = p.search(url)
print m     # m returns _sre.SRE_Match if url is valid, otherwise None

PHP Solution:

$url = 'http://www.foo.com';
preg_match('/^(?!http(s)?:\/\/$)[A-Za-z][A-Za-z0-9\.\-]+(:\d+)?(\/\.*)?$/', $url);
Thierry Lam
Now what happens when you have https://?
brian d foy
The url will still be invalid, but if you insist, I can still handle it.
Thierry Lam