views:

16477

answers:

9

I've been looking for a simple regex for URL's, does anybody have one handy that works well? I didn't find one with the zend framework validation classes and have seen several implementations.

Thanks

+1  A: 

i used this on a few projects, i don't believe i've run into issues, but i'm sure it's not exhaustive:

$text = preg_replace("
  #((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie",
  "'<a href=\"$1\" target=\"_blank\">$3</a>$4'",
  $text
);

most of the random junk at the end is to deal with situations like http://domain.com. in a sentance (to avoid matching the trailing period). i'm sure it could be cleaned up but since it worked I've more or less just copied it over from project to project.

Owen
This has been downvoted... can anyone explain why?
alex
Some things that jump out at me: use of alternation where character classes are called for (every alternative matches exactly one character); and the replacement shouldn't have needed the outer double-quotes (they were only needed because of the pointless /e modifier on the regex).
Alan Moore
Solution does not for the simple case of 'google.com' although it could be argued that 'google.com' is not a valid URL.
John Scipione
@John Scipione: `google.com` is only a valid relative URL path but not a valid absolute URL. And I think that’s what he’s looking for.
Gumbo
A: 

I've used this one with good success - I don't remember where I got it from

$pattern = "/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i";
Peter Bailey
^(http://|https://)?(([a-z0-9]?([-a-z0-9]*[a-z0-9]+)?){1,63}\.)+[a-z]{2,6} (may be too greedy, not sure yet, but it's more flexible on protocol and leading www)
andrewbadera
A: 

There is one here.

Milen A. Radev
+1  A: 

there's also

http://www.php.net/filter

Galen
A: 

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. -- jwz

Who says you need to use a regex? If you're trying to validate if a string is a URL, then use the parse_url function in PHP.

Andy Lester
+39  A: 

Galen is right, filter_var() function is the best way to validate whether a string is URL or not.

var_dump(filter_var('example.com', FILTER_VALIDATE_URL));

It's a bad practice to use regular expressions where is's not necessary.

Stanislav
this is definitely a great alternative, unfortunately it's php 5.2+ (unless you install the PECL version)
Owen
filter_var only works in PHP >= 5.2.0
John Scipione
There's a bug in 5.2.13 (and I think 5.3.2) that prevents urls with dashes in them from validating using this method.
vamin
filter_var will reject http://test-site.com, I have domain names with dashes, wheter they are valid or not. I don't think filter_var is the best way to validate a url. It will allow a url like `http://www`
Cesar
> It will allow a url like 'http://www'It is OK when URL like 'http://localhost'
Stanislav
+6  A: 

As per the PHP manual - parse_url should not be used to validate a URL.

Unfortunately, it seems that filter_var('example.com', FILTER_VALIDATE_URL) does not perform any better.

Both parse_url() and filter_var() will pass malformed URLs such as http://...

Therefore in this case - regex is the better method.

catchdave
This argument doesn't follow. If FILTER_VALIDATE_URL is a little more permissive than you want, tack on some additional checks to deal with those edge cases. Reinventing the wheel with your own attempt at a regex against urls is only going to get you further from a complete check.
Tchalvak
See all the shot-down regexes on this page for examples of why -not- to write your own.
Tchalvak
You make a fair point Tchalvak. Regexes for something like URLs can (as per other responses) be very hard to get right.Regex is not always the answer. Conversely regex is also not always the wrong answer either.The important point is to pick the right tool (regex or otherwise) for the job and not be specifically "anti" or "pro" regex.In hindsight, your answer of using filter_var in combination with constraints on its edge-cases, looks like the better answer (particularly when regex answers start to get to greater than 100 chars or so - making maintenance of said regex a nightmare)
catchdave
+1  A: 

Edit:
As incidence pointed out this code has been DEPRECATED with the release of PHP 5.3.0 (2009-06-30) and should be used accordingly.


Just my two cents but I've developed this function and have been using it for a while with success. It's well documented and separated so you can easily change it.

// Checks if string is a URL
// @param string $url
// @return bool
function isURL($url = NULL) {
 if($url==NULL) return false;

 $protocol = '(http://|https://)';
 $allowed = '([a-z0-9]([-a-z0-9]*[a-z0-9]+)?)';

 $regex = "^". $protocol . // must include the protocol
    '(' . $allowed . '{1,63}\.)+'. // 1 or several sub domains with a max of 63 chars
    '[a-z]' . '{2,6}'; // followed by a TLD
 if(eregi($regex, $url)==true) return true;
 else return false;
}
Frankie
Eregi will be removed in PHP 6.0.0. And domains with "öäåø" will not validate with your function. You probably should convert the URL to punycode first?
incidence
@incidence absolutely agree. I wrote this in March and PHP 5.3 only came out late June setting eregi as DEPRECATED. Thank you. Gonna edit and update.
Frankie
A: 

Peter's Regex doesn't look right to me for many reasons. It allows all kinds of special characters in the domain name and doesn't test for much.

Frankie's function looks good to me and you can build a good regex from the components if you don't want a function, like so:

^(http://|https://)(([a-z0-9]([-a-z0-9]*[a-z0-9]+)?){1,63}\.)+[a-z]{2,6}

Untested but I think that should work.

Also, Owen's answer doesn't look 100% either. I took the domain part of the regex and tested it on a Regex tester tool http://erik.eae.net/playground/regexp/regexp.html

I put the following line:

(\S*?\.\S*?)

in the "regexp" section and the following line:

-hello.com

under the "sample text" section.

The result allowed the minus character through. Because \S means any non-space character.

Note the regex from Frankie handles the minus because it has this part for the first character:

[a-z0-9]

Which won't allow the minus or any other special character.

joedevon