views:

355

answers:

5

Hi, I'm looking for a decent regex to match a URL (a full URL with scheme, domain, path etc.) I would normally use filter_var but I can't in this case as I have to support PHP<5.2!

I've searched the web but can't find anything that I'm confident will be fool-proof, and all I can find on SO is people saying to use filter_var.

Does anybody have a regex that they use for this?

My code (just so you can see what I'm trying to achieve):

function validate_url($url){
    if (function_exists('filter_var')){
        return filter_var($url, FILTER_VALIDATE_URL);
        }
    return preg_match(REGEX_HERE, $url);
    }
+1  A: 

I've seen a regex that could actually validate any kind of valid URL but it was two pages long...

You're probably better off parsing the url with parse_url and then checking if all of your required bits are in order.

Addition: This is a snip of my URL class:

public static function IsUrl($test)
{
    if (strpos($test, ' ') > -1)
    {
        return false;
    }
    if (strpos($test, '.') > 1)
    {
        $check = @parse_url($test);
        return is_array($check)
            && isset($check['scheme'])
            && isset($check['host']) && count(explode('.', $check['host'])) > 1
}
    return false;
}

It tests the given string and requires some basics in the url, namely that the scheme is set and the hostname has a dot in it.

Kris
+1 for parse_url.
Frank Farmer
as stated in the comments above, “This function is not meant to validate the given URL” - It's not the behavior it is intended for. Regex is meant for matching/replacing patterns in a string and is optimized for that, what you're suggesting could potentially involve a lot of logic.
Rowan
-1 for parse_url. It even parses `http://..`, and that isn't a valid url.
poke
@poke: that's why you have to check if it returns you the bits you require. reading isn't that hard is it?
Kris
It may not be meant for validating a URL, but it's better than half the URL validating regexes you'll find out there.
Frank Farmer
A: 
!(https?://)?([-_a-z0-9]+\.)*([-_a-z0-9]+)\.([a-z]{2,4})(/?)(.*)!i

I use this regular expression for validating URLs. So far it didn't fail me a single time :)

bisko
Frank Farmer
+1  A: 

You could try this one. I haven't tried it myself but it's surely the biggest regexp I've ever seen, haha.

^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$
aefxx
I'll give it a go. (I just tried to paste a huge email regex I use in here but it was bigger than the allocated 600 characters :S)
Rowan
Just a note, the regexp can be dangerous when dealing with custom TLDs.
poke
Could you elaborate @poke?
Rowan
What poke means is that if the tld you're using isn't whitelisted in the regex, it will fail the url. So if you forget to allow for a .tv domain name, all .tv domain names will be disallowed. That's only true if you actually use a TLD whitelist though (which this regex DOES seem to do, but it also allows for any 2 char TLD)
Kris
ah ok, I will probably modify it slightly then. If I just match a-z\. rather than testing against a list then I know that I'll catch everything. I'm not too fussed about an invalid TLD
Rowan
+1  A: 

I have created a solution for validating the domain. While it does not specifically cover the entire URL, it is very detailed and specific. The question you need to ask yourself is, "Why am I validating a domain?" If it is to see if the domain actually could exist, then you need to confirm the domain (including valid TLDs). The problem is, too many developers take the shortcut of ([a-z]{2,4}) and call it good. If you think along these lines, then why call it URL validation? It's not. It's just passing the URL through a regex.

I have an open source class that will allow you to validate the domain not only using the single source for TLD management (iana.org), but it will also validate the domain via DNS records to make sure it actually exists. The DNS validating is optional, but the domain will be specifically valid based on TLD.

For example: example.ay is NOT a valid domain as the .ay TLD is invalid. But using the regex posted here ([a-z]{2,4}), it would pass. I have an affinity for quality. I try to express that in the code I write. Others may not really care. So if you want to simply "check" the URL, you can use the examples listed in these responses. If you actually want to validate the domain in the URL, you can have at the class I created to do just that. It can be downloaded at: http://code.google.com/p/blogchuck/source/browse/trunk/domains.php

It validates based on the RFCs that "govern" (using the term loosely) what determines a valid domain. In a nutshell, here is what the domains class will do: Basic rules of the domain validation

  • must be at least one character long
  • must start with a letter or number
  • contains letters, numbers, and hyphens
  • must end in a letter or number
  • may contain multiple nodes (i.e. node1.node2.node3)
  • each node can only be 63 characters long max
  • total domain name can only be 255 characters long max
  • must end in a valid TLD
  • can be an IP4 address

It will also download a copy of the master TLD file iana.org only after checking your local copy. If your local copy is outdated by 30 days, it will download a new copy. The TLDs in the file will be used in the REGEX to validate the TLD in the domain you are validating. This prevents the .ay (and other invalid TLDs) from passing validation.

This is a lengthy bit of code, but very compact considering what it does. And it is the most accurate. That's why I asked the question earlier. Do you want to do "validation" or simple "checking"?

cdburgess
Your solution is pretty comprehensive, but probably overkill for what I'd like to achieve! I completely agree that for full domain validation, the TLD should be an existing one but I'm quite happy for it not to be. Feel free to tell me off for it, but all I really need to know is that the user has entered something that looks like a URL rather than entering a relative address, an email address or whatever other weird data a user might enter when asked for a URL!
Rowan
p.s. I've bookmarked your code, potentially very useful!
Rowan
No worries. Again, that's why I ask, are you really looking for "validation" or simply "checking". I see you want the later. So the other solutions provided should be sufficient. As for telling you off, that's not what these types of sites are for. Life is too short to be that aggressive. You know what you want more than I know what you want. ;) Anyway, good luck with your project and Happy Coding!
cdburgess
BTW, the question you have posted has inspired me to extend my domain validation code to encompass URL validation. I am going to look into that for a future release.
cdburgess
That's actually a great thing to include in my URL class, actual validation. Thanks for the walkthrough :)
Kris
A: 

The solution posted by Kris seems the best. I spent hours trying to find the answer to this question. At first I started by checking out the zend framework validators and they do not have one. They do have an internal uri checker but it does not seems fit for general purpose url checking and it was not build for this either.

Then I checked filter_var($url, FILTER_VALIDATE_URL) and this as per reading Which regexp php is using for filter_var($url, FILTER_VALIDATE_URL)? which gave the direct references to the C source code is just using parse_url and then running some checks to see if its an array, [contains hostname, etc..].

So the solution by Kris seems best because you have the most control and are still using parse_url.

Side note: using a regex seems fairly impractical.

Going Further, I would be interested to see if anyone does a preliminary check with parse_url and then uses curl to check the http response header for 200 or something like where you actually check if the link is real. I dont know if this would wasteful or not tho.