ansaurus

Question

Answer 1

+1 A:

I've seen a regex that could actually validate any kind of valid URL but it was two pages long...

You're probably better off parsing the url with parse_url and then checking if all of your required bits are in order.

Addition: This is a snip of my URL class:

public static function IsUrl($test)
{
    if (strpos($test, ' ') > -1)
    {
        return false;
    }
    if (strpos($test, '.') > 1)
    {
        $check = @parse_url($test);
        return is_array($check)
            && isset($check['scheme'])
            && isset($check['host']) && count(explode('.', $check['host'])) > 1
}
    return false;
}

It tests the given string and requires some basics in the url, namely that the scheme is set and the hostname has a dot in it.

Kris 2010-03-05 22:11:59

+1 for parse_url.

Frank Farmer 2010-03-05 22:13:38

as stated in the comments above, “This function is not meant to validate the given URL” - It's not the behavior it is intended for. Regex is meant for matching/replacing patterns in a string and is optimized for that, what you're suggesting could potentially involve a lot of logic.

Rowan 2010-03-05 22:20:03

-1 for parse_url. It even parses `http://..`, and that isn't a valid url.

poke 2010-03-05 22:20:35

@poke: that's why you have to check if it returns you the bits you require. reading isn't that hard is it?

Kris 2010-03-05 22:24:36

It may not be meant for validating a URL, but it's better than half the URL validating regexes you'll find out there.

Frank Farmer 2010-03-06 15:37:01

Answer 2

A:

!(https?://)?([-_a-z0-9]+\.)*([-_a-z0-9]+)\.([a-z]{2,4})(/?)(.*)!i

I use this regular expression for validating URLs. So far it didn't fail me a single time :)

bisko 2010-03-05 22:12:26

Frank Farmer 2010-03-06 15:35:00

Answer 3

+1 A:

You could try this one. I haven't tried it myself but it's surely the biggest regexp I've ever seen, haha.

^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$

aefxx 2010-03-05 22:12:53

I'll give it a go. (I just tried to paste a huge email regex I use in here but it was bigger than the allocated 600 characters :S)

Rowan 2010-03-05 22:22:43

Just a note, the regexp can be dangerous when dealing with custom TLDs.

poke 2010-03-05 22:25:10

Could you elaborate @poke?

Rowan 2010-03-05 22:27:18

What poke means is that if the tld you're using isn't whitelisted in the regex, it will fail the url. So if you forget to allow for a .tv domain name, all .tv domain names will be disallowed. That's only true if you actually use a TLD whitelist though (which this regex DOES seem to do, but it also allows for any 2 char TLD)

Kris 2010-03-05 22:39:18

ah ok, I will probably modify it slightly then. If I just match a-z\. rather than testing against a list then I know that I'll catch everything. I'm not too fussed about an invalid TLD

Rowan 2010-03-05 22:42:31

Answer 4

+1 A:

I have created a solution for validating the domain. While it does not specifically cover the entire URL, it is very detailed and specific. The question you need to ask yourself is, "Why am I validating a domain?" If it is to see if the domain actually could exist, then you need to confirm the domain (including valid TLDs). The problem is, too many developers take the shortcut of ([a-z]{2,4}) and call it good. If you think along these lines, then why call it URL validation? It's not. It's just passing the URL through a regex.

I have an open source class that will allow you to validate the domain not only using the single source for TLD management (iana.org), but it will also validate the domain via DNS records to make sure it actually exists. The DNS validating is optional, but the domain will be specifically valid based on TLD.

For example: example.ay is NOT a valid domain as the .ay TLD is invalid. But using the regex posted here ([a-z]{2,4}), it would pass. I have an affinity for quality. I try to express that in the code I write. Others may not really care. So if you want to simply "check" the URL, you can use the examples listed in these responses. If you actually want to validate the domain in the URL, you can have at the class I created to do just that. It can be downloaded at: http://code.google.com/p/blogchuck/source/browse/trunk/domains.php

It validates based on the RFCs that "govern" (using the term loosely) what determines a valid domain. In a nutshell, here is what the domains class will do: Basic rules of the domain validation

must be at least one character long
must start with a letter or number
contains letters, numbers, and hyphens
must end in a letter or number
may contain multiple nodes (i.e. node1.node2.node3)
each node can only be 63 characters long max
total domain name can only be 255 characters long max
must end in a valid TLD
can be an IP4 address

It will also download a copy of the master TLD file iana.org only after checking your local copy. If your local copy is outdated by 30 days, it will download a new copy. The TLDs in the file will be used in the REGEX to validate the TLD in the domain you are validating. This prevents the .ay (and other invalid TLDs) from passing validation.

This is a lengthy bit of code, but very compact considering what it does. And it is the most accurate. That's why I asked the question earlier. Do you want to do "validation" or simple "checking"?

cdburgess 2010-03-06 03:50:01

Your solution is pretty comprehensive, but probably overkill for what I'd like to achieve! I completely agree that for full domain validation, the TLD should be an existing one but I'm quite happy for it not to be. Feel free to tell me off for it, but all I really need to know is that the user has entered something that looks like a URL rather than entering a relative address, an email address or whatever other weird data a user might enter when asked for a URL!

Rowan 2010-03-06 13:35:45

p.s. I've bookmarked your code, potentially very useful!

Rowan 2010-03-06 13:36:42

No worries. Again, that's why I ask, are you really looking for "validation" or simply "checking". I see you want the later. So the other solutions provided should be sufficient. As for telling you off, that's not what these types of sites are for. Life is too short to be that aggressive. You know what you want more than I know what you want. ;) Anyway, good luck with your project and Happy Coding!

cdburgess 2010-03-06 14:25:52

BTW, the question you have posted has inspired me to extend my domain validation code to encompass URL validation. I am going to look into that for a future release.

cdburgess 2010-03-06 14:27:17

That's actually a great thing to include in my URL class, actual validation. Thanks for the walkthrough :)

Kris 2010-03-10 17:24:48

Answer 5

A:

The solution posted by Kris seems the best. I spent hours trying to find the answer to this question. At first I started by checking out the zend framework validators and they do not have one. They do have an internal uri checker but it does not seems fit for general purpose url checking and it was not build for this either.

Then I checked filter_var($url, FILTER_VALIDATE_URL) and this as per reading Which regexp php is using for filter_var($url, FILTER_VALIDATE_URL)? which gave the direct references to the C source code is just using parse_url and then running some checks to see if its an array, [contains hostname, etc..].

So the solution by Kris seems best because you have the most control and are still using parse_url.

Side note: using a regex seems fairly impractical.

Going Further, I would be interested to see if anyone does a preliminary check with parse_url and then uses curl to check the http response header for 200 or something like where you actually check if the link is real. I dont know if this would wasteful or not tho.

2010-08-17 18:11:13

ansaurus

tags:

views:

answers:

PHP regex for validating a URL

related questions