tags:

views:

778

answers:

3

I'm looking for a regex that will allow me to validate whether or not a string is the reference to a website address, or a specific page in that website.

So it would match:

But not:

Any ideas? I can't quite figure out how to handle allowing the / at the end of the URL.

+4  A: 

Try this:

(http|ftp|https)://([a-zA-Z0-9\-\.]+)/?
yjerem
Remember that if you are using php to escape the "/" or the regex will not compile :)
nlaq
Thanks Jeremy! That does the trick.@LaQuet - I'm using this in javascript actually, but thanks for the heads up.
amdfan
+2  A: 

Great answer by Jeremy. Depending on which regex dialect you're using to match, you might want to wrap the whole expression with anchors (to avoid matching URLs like http://example.com/bin/cgi?returnUrl=http://google.com), and maybe generalize the valid protocol and domain name characters:

^\w+://(\w+\.)+\w+/?$
Dov Wasserman
Good point, thanks for the info.
amdfan
+3  A: 

This is a shortened version of my full URI validation pattern, based on the specification. I wrote this because the specification allows many characters never included in any validation pattern I've found on the web. You'll see that the user/pass (and in the second pattern, path and query string) are far more permissive than you'd have thought.

/^(https?|ftp):\/\/(?#                                      protocol
)(([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+(?#         username
)(:([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?(?#      password
)@)?(?#                                                     auth requires @
)((([a-z0-9][a-z0-9-]*[a-z0-9]\.)*(?#                       domain segments AND
)[a-z]{2}[a-z0-9-]*[a-z0-9](?#                              top level domain OR
)|(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5]\.){3}(?#
    )(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])(?#             IP address
))(:\d+)?(?#                                                port
))\/?$/i

And since I've taken the time to break this out to be somewhat more readable, here is the complete pattern:

/^(https?|ftp):\/\/(?#                                      protocol
)(([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+(?#         username
)(:([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?(?#      password
)@)?(?#                                                     auth requires @
)((([a-z0-9][a-z0-9-]*[a-z0-9]\.)*(?#                       domain segments AND
)[a-z]{2}[a-z0-9-]*[a-z0-9](?#                              top level domain OR
)|(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5]\.){3}(?#
    )(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])(?#             IP address
))(:\d+)?(?#                                                port
))(((\/+([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*(?# path
)(\?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)(?#      query string
)?)?)?(?#                                                   path and query string optional
)(#([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?(?#      fragment
)$/i

Note that some (all?) javascript implementations do not support comments in regular expressions.

eyelidlessness
Wow, thanks for the outstanding answer. I think it's overkill for me - I'm using this regex more as a warning to the user rather than a requirement, so I prefer the simple version. But that is definitely an outstanding resource.
amdfan
I appreciate the kind words. I'm curious why you'd go with one that's less capable though? If nothing else, besides being written against the spec, it also allows for IP addresses and ports, neither of which are terribly uncommon for user-submitted URLs.
eyelidlessness