views:

1250

answers:

2

Can a single regex be used to valdate urls and match all the parts, I have been working on one and what I have come up with so far is:

(?:(?P<scheme>[a-z]*?)://)?(?:(?P<username>.*?):?(?P<password>.*?)?@)?(?P<hostname>.*?)/(?:(?:(?P<path>.*?)\?)?(?P<file>.*?\.[a-z]{1,6})?(?:(?:(?P<query>.*?)#?)?(?P<fragment>.*?)?)?)?

however this does not work, it should match all of the following examples:

http://username:[email protected]/path?arg=value#anchor
http://www.domain.com/
http://www.doamin.co.uk/
http://www.yahoo.com/
http://www.google.au/
https://username:[email protected]/
ftp://user:[email protected]/path/
https://www.blah1.subdoamin.doamin.tld/
domain.tld/#anchor
doamin.tld/?query=123
domain.co.uk/
domain.tld
http://www.domain.tld/index.php?var1=blah
http://www.domain.tld/path/to/index.ext
mailto://[email protected]

and provide a named capture for all the components:

scheme eg. http https ftp ftps callto mailto and any other one not listed
username
password
hostname including subdomains, domainand tld
path eg /images/profile/
filename eg file.ext
query string eg. ?foo=bar&bar=foo
fragment eg. #anchor

With the hostname as the only mandatory field.

We can assume that this is coming from a form specificly asking for an url, and is not going to be used to find links in text.

+3  A: 

Modified version of mingfai's regular expression:

/^((?P<scheme>https?|ftp):\/)?\/?((?P<username>.*?)(:(?P<password>.*?)|)@)?(?P<hostname>[^:\/\s]+)(?P<port>:([^\/]*))?(?P<path>(\/\w+)*\/)(?P<filename>[-\w.]+[^#?\s]*)?(?P<query>\?([^#]*))?(?P<fragment>#(.*))?$/
strager
This matches the following: [email protected]/path?arg=value#anchor, [email protected]/path/, http://www.domain.tld/index.php?var1=blah, http://www.domain.tld/path/to/index.php so it clearly does not work
Unkwntech
@Unkwntech, Woops, forgot the obvious username/password deal! Will edit promptly.
strager
It's getting better but it still does not match http://www.domain.com/ or https://username:[email protected]/
Unkwntech
Okay ... I've actually taken the time to test it, and made it work (hopefully).
strager
WOW!, it seems to work..... Time to test the hell out of it.
Unkwntech
Seems to 99% work... +1 +A
Unkwntech
+3  A: 

Can a single regex be used to valdate urls and match all the parts

No.

strager's regex is impressive, but at the end of the day it's less readable, maintainable and reliable than just using a proper URI parser. It necessarily rejects valid URIs and accepts strings that are not URIs, because the rules of formatting URIs cannot be fully expressed in a regex.

mailto://[email protected]

There shouldn't be a '//' in a mailto URI. You can't tell what format the remainder (post-:) of the URI is going to be until you've read the scheme; many URI schemes do not conform to the credentials@host/path format. Best to accept only specific schemes where you know how to parse their URIs.

bobince