ansaurus

Question

How do I write a regular expression for a URL without the scheme?

Answer 1

A:

My guess is

/^[\p{Alnum}-]+(\.[\p{Alnum}-]+)+$/

In more primitive RE syntax that would be

/^[0-9A-Za-z-]+(\.[0-9A-Za-z-]+)+$/

Or even more primitive still:

/^[0-9A-Za-z-][0-9A-Za-z-]*\.[0-9A-Za-z-][0-9A-Za-z-]*(\.[0-9A-Za-z-][0-9A-Za-z-]*)*$/

Axeman 2009-06-09 16:12:06

Answer 2

A:

URL syntax is quite complex, you need to narrow it down a bit. You can match anything.ext, if that is enough:

^[a-zA-Z0-9.]+\.[a-zA-Z]{2,4}$

soulmerge 2009-06-09 16:18:03

This doesn't work for the example input, which has more than one dot.

brian d foy 2009-06-09 16:44:41

Answer 3

+4 A:

^[A-Za-z0-9][A-Za-z0-9.-]+(:\d+)?(/.*)?$

string must start with an ASCII letter or number
ASCII letters, numbers, dots and dashes follow (no slashes or colons allowed)
optional: a port is allowed (":8080")
optional: anything after a slash may follow (since you said "URL")
then the end of the string

Thoughts:

no line breaks allowed
no validity or sanity checking
no support for "internationalized domain names" (IDNs)
leave off the "optional:" parts if you like, but be sure to include the final "$"

If your regex flavor supports it, you can shorten the above to:

^[A-Za-z\d][\w.-]+(:\d+)?(/.*)?$

Be aware that \w may include Unicode characters in some regex flavors. Also, \w includes the underscore, which is invalid in host names. An explicit approach like the first one would be safer.

Tomalak 2009-06-09 16:18:06

Actually the RFC has a domainlabel as the possible first token, defined as alphanum. [ domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum ]. So it looks like your first bullet point is wrong.

Axeman 2009-06-09 18:29:34

To me that reads as "(alphanum) or (alphanum plus (any number of alphanum or '-') plus alphanum)". What am I missing here?

Tomalak 2009-06-09 18:48:49

Your regex starts with alpha, excluding a legal num in the first position. That's what Axeman is talking about.

brian d foy 2009-06-09 20:12:57

Oh, I see. My bad, I was to fixed on the hyphen to notice. :-) Corrected, thanks for pointing out.

Tomalak 2009-06-09 20:17:38

Answer 4

+1 A:

If you're trying to do this for some real code, find the URL parsing library for your language and use that. If you don't want to use it, look inside to see what it does.

The thing that you are calling "resource" is known as a "scheme". It's documented in RFC 1738 which says:

[2.1] ... In general, URLs are written as follows:

   <scheme>:<scheme-specific-part>

A URL contains the name of the scheme being used (<scheme>) followed by a colon and then a string (the <scheme-specific-part>) whose interpretation depends on the scheme.

And, later in the BNF,

scheme = 1*[ lowalpha | digit | "+" | "-" | "." ]

So, if a scheme is there, you can match it with:

/^[a-z0-9+.-]+:/i

If that matches, you have what the URL syntax considers a scheme and your validation fails. If you have strings with port numbers, like www.example.com:80, then things get messy. In practice, I haven't dealt with schemes with - or ., so you might add a real world fudge to get around that until you decide to use a proper library.

Anything beyond that, like checking for existing and reachable domains and so on, is better left to a library that's already figured it all out.

brian d foy 2009-06-09 16:39:39

Answer 5

A:

Thanks guys, I think I have a Python and a PHP solution. Here they are:

Python Solution:

import re

url = 'http://www.foo.com'
p = re.compile(r'^(?!http(s)?://$)[A-Za-z][A-Za-z0-9.-]+(:\d+)?(/.*)?$')
m = p.search(url)
print m     # m returns _sre.SRE_Match if url is valid, otherwise None

PHP Solution:

$url = 'http://www.foo.com';
preg_match('/^(?!http(s)?:\/\/$)[A-Za-z][A-Za-z0-9\.\-]+(:\d+)?(\/\.*)?$/', $url);

Thierry Lam 2009-06-09 18:06:35

Now what happens when you have https://?

brian d foy 2009-06-09 18:26:59

The url will still be invalid, but if you insist, I can still handle it.

Thierry Lam 2009-06-09 18:35:37

ansaurus

tags:

views:

answers:

How do I write a regular expression for a URL without the scheme?

related questions