tags:

views:

368

answers:

12

What is the best way to determine if a string represents a web address? I need to let the user enter a web address into a form, but how do I validate the input? The user should be allowed to enter strings like "http://www.google.com" or "www.vg.no", but he shouldn't be required to enter the "http://". Also, there are web pages like "tv2.no" which is harder to validate. If I check if the string contains "www" or "http://" I have a strong clue, but I'm still not 100% sure. Can I ever BE a 100% sure? I don't think so, but maybe some of the fine minds here at SO can enlighten me?

+2  A: 

How about using a Regular Expression?

The exact means of implementation will depend on the language you're using.

Richard Ev
I think we all know regex can be used for pattern matching, I think he's asking for a heuristic to allow human-readable 'urls' to be accepted, i.e. slashdot.org instead of http://slashdot.org
roe
Aren't things like slashdot.org simply a subset of the strings that the regex should accept?
Richard Ev
+1  A: 

The easiest way to be reasonable sure is to use a regular expression that makes sure you have at least two components of the domain name. That way you can handle most bad cases. It should look something like this:

/^(http:\/\/)?(\w+)(\.\w+)+$/
Ola Bini
"wonky donkey" passes that regex, and isn't a valid address (containing a space and all)
Rowland Shaw
nope, it doesn't. the slash before the dot is very important. you don't have any dots in it.that said, you're correct. it shouldn't be .*?. probably better to use something like [[:alpha:]]*
Ola Bini
What if there is the port number? What if there is a query in the URL? It is much better to use some tested regexp than trying to invent the wheel yourself.
zoul
Absolutely. It all depends on what level of certainty you want. That's why I didn't say my expression was the solution, just an example of how a simple version could look.
Ola Bini
A: 

If you don't want to require that they enter http:// (or https://) then the only thing you can really go on is whether the string contains a "." (I assume you don't need to handle "internal" servers?). You could also validate against known domains and check for invalid characters, but beyond that pretty much anything goes.

As for actual implementation, regex would be the way to go if you can stomach it.. there's no doubt countless examples of validating URLs if you Google.

Steven Robbins
A: 

If you're not going to enforce it to be a valid URI (I.e. you make the scheme optional) then the only real option is to try and connect to it via HTTP.

Rowland Shaw
A: 

I think the quickest way to do this would be through a Regular Expression test. This however will not prove whether its a valid URL

FailBoy
+3  A: 

First, try to validate if the input text is a well-formed URL by using regular expressions. If the check is OK, try a DNS lookup to validate if the host is known. Don't forget the special case of localhost or 127.0.0.1. Also take care of hosts specified by their IP address. If these checks are OK, you may want to try an actual connection.

If these checks fail, you can modify the input text and check again. Possible modifications include:

  • prepend http://
  • prepend www.
  • append .com, .org, .net, whatever
  • append :8080, :8888, whatever
  • mix any of the above solutions
  • try also prepending file:/// for a local access
mouviciel
+1 for the intent of maximizing usability. Also, some browsers etc will give a Google search if you just enter "Barcelona" in the address bar, and that is not always a bad thing (though of course it may be a bad thing in the OP's context - which he should have explained better).
Daniel Daranas
I would feel uneasy about connecting to the given URL because of the security implications, especially if the poster is not 100% sure what he is doing.
zoul
I think it could be safe if 'connecting' means only checking success or failure without recursively downloading every inline image, javascript, CSS, ... It may be performed for instance with text-based lynx.
mouviciel
Yes, but there’s always the possibility that somebody passes someting like www.site.com/delete.php?all and hides his IP from the victim or people could pass file:///usr/lib/foo and check if file exists on your system etc.
zoul
To put it another way: I would not cross the line of “100% safe” for something as minor as an URL check. There are also additional problems: what if the target site is down now and will be back in ten minutes?
zoul
These are very interesting points, together with the ones mentionned by Ola Bini (e.g., denial-of-service argument). The regex solution does not fall in these drawbacks but ensures only that a text is a well-formed URL, not a valid web address. No solution is ideal. the answer may be in middle way.
mouviciel
You could do the DNS check without the connection check. This should stop issues with security.
seanyboy
DNS check ensures that the host is known, not that it runs a HTTP server. And at the end, verifying if a text is a valid web address suggests that the user will want to use that text to connect to the address.
mouviciel
A: 

See Regexp::Common on CPAN, especially R::C::URI and R::C::URI::http. Even if you can’t use the modules themselves, there are the regular expressions in the source. This is a good start.

zoul
+3  A: 

Notice that the following two are also valid web addresses. Do you want to allow them?

  • localhost
  • 208.77.188.166
Konrad Rudolph
+1  A: 

Can you do a DNS lookup from your application, this will get round any "i'm not sure if it's a real address".

Greg B
+1  A: 

You could use the validation feature of Zend_Uri

Jack Sleight
+5  A: 

Apologies for the ensuing expression but it seems to capture most (if not all) cases :

^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~/|/)?(?#Username:Password)(?:\w+:\w­+@)?
(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)
(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[­a-z]{2}))(?#Port)
(?::[\d]{1,5})?(?#Directories)(?:(?:(?:/(?:[-\w~!$+|.,=]|%­[a-f\d]{2})+)+|/)+|\?|#)?
(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(­?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)
(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w­~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)
(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})+­)?(?#What not to end in)[^.!,:;?]$
Learning
In your section #TopLevelDomains, you can add whatever seems appropriate to your needs. I think of local LANs with .corp or .local domains.
mouviciel
+1, but where does this come from? Is there a testing suite? There are already too many sites with botchered URL validation where somebody blindly copied a regexp from a web forum…
zoul
Don't forget the oft used .museum!
Simucal
+1 for giving the expression and not only the typical "how about using regex" or similar generic answers. This is what's making StackOverflow *really* helpful
Kai
+1 Well deserved. :)
Aaron Digulla
+3  A: 

My recommendation would be to not validate exactly at all. Instead, use a regular expression based approach, and if that doesn't match you can give a soft warning: "what you wrote doesn't look like a valid address. are you sure this is what you want to write?".

Definitely do not follow the idea of trying to connect to the address. That would open you up for all kinds of nasty security problems, including having your web site used for denial-of-service attacks against other web sites. That would land you in legal trouble.

Doing a DNS lookup is costly, but viable if you deem it's worth the cost.

Ola Bini