What is the best way to determine if a string represents a web address? I need to let the user enter a web address into a form, but how do I validate the input? The user should be allowed to enter strings like "http://www.google.com" or "www.vg.no", but he shouldn't be required to enter the "http://". Also, there are web pages like "tv2.no" which is harder to validate. If I check if the string contains "www" or "http://" I have a strong clue, but I'm still not 100% sure. Can I ever BE a 100% sure? I don't think so, but maybe some of the fine minds here at SO can enlighten me?
How about using a Regular Expression?
The exact means of implementation will depend on the language you're using.
The easiest way to be reasonable sure is to use a regular expression that makes sure you have at least two components of the domain name. That way you can handle most bad cases. It should look something like this:
/^(http:\/\/)?(\w+)(\.\w+)+$/
If you don't want to require that they enter http:// (or https://) then the only thing you can really go on is whether the string contains a "." (I assume you don't need to handle "internal" servers?). You could also validate against known domains and check for invalid characters, but beyond that pretty much anything goes.
As for actual implementation, regex would be the way to go if you can stomach it.. there's no doubt countless examples of validating URLs if you Google.
If you're not going to enforce it to be a valid URI (I.e. you make the scheme optional) then the only real option is to try and connect to it via HTTP.
I think the quickest way to do this would be through a Regular Expression test. This however will not prove whether its a valid URL
First, try to validate if the input text is a well-formed URL by using regular expressions. If the check is OK, try a DNS lookup to validate if the host is known. Don't forget the special case of localhost or 127.0.0.1. Also take care of hosts specified by their IP address. If these checks are OK, you may want to try an actual connection.
If these checks fail, you can modify the input text and check again. Possible modifications include:
- prepend
http://
- prepend
www.
- append
.com
,.org
,.net
, whatever - append
:8080
,:8888
, whatever - mix any of the above solutions
- try also prepending
file:///
for a local access
See Regexp::Common on CPAN, especially R::C::URI and R::C::URI::http. Even if you can’t use the modules themselves, there are the regular expressions in the source. This is a good start.
Notice that the following two are also valid web addresses. Do you want to allow them?
localhost
208.77.188.166
Can you do a DNS lookup from your application, this will get round any "i'm not sure if it's a real address".
Apologies for the ensuing expression but it seems to capture most (if not all) cases :
^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~/|/)?(?#Username:Password)(?:\w+:\w+@)?
(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)
(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)
(?::[\d]{1,5})?(?#Directories)(?:(?:(?:/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|/)+|\?|#)?
(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)
(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)
(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})+)?(?#What not to end in)[^.!,:;?]$
My recommendation would be to not validate exactly at all. Instead, use a regular expression based approach, and if that doesn't match you can give a soft warning: "what you wrote doesn't look like a valid address. are you sure this is what you want to write?".
Definitely do not follow the idea of trying to connect to the address. That would open you up for all kinds of nasty security problems, including having your web site used for denial-of-service attacks against other web sites. That would land you in legal trouble.
Doing a DNS lookup is costly, but viable if you deem it's worth the cost.