views:

63

answers:

6

I have this text input, and I need to check if the string is a valid web address, like http://www.example.com. How can be done with regular expressions in PHP?

A: 

Found this:

(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?

From Here:

http://stackoverflow.com/questions/591859/a-regex-that-validates-a-web-address-and-matches-an-empty-string

cbattlegear
Thanks, i will check this.
Edgar
"www.mywebsite.com" won't be a valid website here
Colin Hebert
@Colin HEBERT: `www.mywebsite.com` is not a valid website anywhere except for when you type it in to your address bar (where `http://` is assumed). In most other instances, it's assumed to be a filename (and hence would be a relative path). So it depends on your exact use if you want it to validate or not (Personally, I would prepend `http://` if non-existant, and then run through a check such as this, or `filter_var`)...
ircmaxell
@ircmaxell see comments on @Gabriel's post
Colin Hebert
A: 

You need to first understand a web address before you can begin to parse it effectively. Yes, http://www.example.com is a valid address. So is www.example.com. Or example.com. Or http://example.com. Or prefix.example.com.

Have a look at the specifications for a URI, especially the Syntax components.

Stephen
Thanks for te reference.
Edgar
A: 

I found the below from http://www.roscripts.com/PHP_regular_expressions_examples-136.html

//URL: Different URL parts
//Protocol, domain name, page and CGI parameters are captured into backreferenes 1 through 4
'\b((?#protocol)https?|ftp)://((?#domain)[-A-Z0-9.]+)((?#file)/[-A-Z0-9+&@#/%=~_|!:,.;]*)?((?#parameters)\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?'

//URL: Different URL parts
//Protocol, domain name, page and CGI parameters are captured into named capturing groups.
//Works as it is with .NET, and after conversion by RegexBuddy on the Use page with Python, PHP/preg and PCRE.
'\b(?<protocol>https?|ftp)://(?<domain>[-A-Z0-9.]+)(?<file>/[-A-Z0-9+&@#/%=~_|!:,.;]*)?(?<parameters>\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?'

//URL: Find in full text
//The final character class makes sure that if an URL is part of some text, punctuation such as a 
//comma or full stop after the URL is not interpreted as part of the URL.
'\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]'

//URL: Replace URLs with HTML links
preg_replace('\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]', '<a href="\0">\0</a>', $text);
Gabriel
"www.mywebsite.com" won't be a valid website here
Colin Hebert
@Colin HEBERT: `www.mywebsite.com` is not an absolute URL; it will only be interpreted as URL path.
Gumbo
@Gumbo, still, it's a valid website, the "http://" part could/should be added at the output if needed.
Colin Hebert
@Colin HEBERT: It’s just the tolerance of the web browser to add that if entered into the location bar. But it’s still not a valid absolute URL.
Gumbo
@Gumbo, and a web site that require an URL should do this tolerance too. It's quite easy to check if the given url starts with [protocol]:// and add http:// if it doesn't. For instance in your profile on SO :)
Colin Hebert
@Colin HEBERT: That depends on how this regular expression is to be used. If it is to match a string it’s fine to use such a scheme. But if it is to search for a URL in a text, then this might not be a good solution.
Gumbo
@Gumbo, You're right, but I believe that @Edgar wants to check if the whole string is a valid website, so I suppose that's a simple field like the web-site of a user/member.
Colin Hebert
@Colin HEBERT: A web site is a collection of web pages and not a URL that describes only the location of it. So please use the proper terms.
Gumbo
But should `command.com` be a URL? No. It's an ambiguous string. Does it refer to a host? Possibly. Does it refer to a file? Possibly. It's context sensitive. And you need to be aware that without a context, it's impossible to do this correctly 100% of the time. You could say that anything with a leading `http://` OR a path afterwards (Such as `example.com/foo`) should be transformed into a URL, but note that only the first case is actually a URL...
ircmaxell
@Gumbo, you're right, my bad but I can edit my comments now. @ircmaxell Yes it's context sensitive, and in this context he needs to make sure that the string passed is a valid *URL*, so even if it's only a possible URL, you have to suppose that it's one (unless you're in a context that makes this data really sensitive).
Colin Hebert
+1  A: 

Use the filter extension:

filter_var($url, FILTER_VALIDATE_URL);

This will be far more robust than any regex you can write.

nikic
That sounds interesting, thanks nikic.
Edgar
A: 

In most cases you don't have to check if a string is a valid address.

Either it is, and a web site will be available or it won't be and the user will simply go back.

You should only escape illegals characters to avoid XSS, if your user doesn't want do give a valid website, it should be his problem.

(In most cases).

PS: If you still want to check URLs, look at nikic's answer.

Colin Hebert
+1  A: 

To match more protocols, you can do:

((https?|s?ftp|gopher|telnet|file|notes|ms-help)://)?[\w:#@%/;$()~=\.&-]+
M42