ansaurus

Question

How to tell if a string is a web address?

Answer 1

+2 A:

The exact means of implementation will depend on the language you're using.

Richard Ev 2009-02-09 08:57:19

I think we all know regex can be used for pattern matching, I think he's asking for a heuristic to allow human-readable 'urls' to be accepted, i.e. slashdot.org instead of http://slashdot.org

roe 2009-02-09 08:59:50

Aren't things like slashdot.org simply a subset of the strings that the regex should accept?

Richard Ev 2009-02-09 09:22:38

Answer 2

+1 A:

The easiest way to be reasonable sure is to use a regular expression that makes sure you have at least two components of the domain name. That way you can handle most bad cases. It should look something like this:

/^(http:\/\/)?(\w+)(\.\w+)+$/

Ola Bini 2009-02-09 08:58:26

"wonky donkey" passes that regex, and isn't a valid address (containing a space and all)

Rowland Shaw 2009-02-09 09:01:15

nope, it doesn't. the slash before the dot is very important. you don't have any dots in it.that said, you're correct. it shouldn't be .*?. probably better to use something like [[:alpha:]]*

Ola Bini 2009-02-09 09:10:06

What if there is the port number? What if there is a query in the URL? It is much better to use some tested regexp than trying to invent the wheel yourself.

zoul 2009-02-09 09:18:00

Absolutely. It all depends on what level of certainty you want. That's why I didn't say my expression was the solution, just an example of how a simple version could look.

Ola Bini 2009-02-09 09:33:12

Answer 3

A:

If you don't want to require that they enter http:// (or https://) then the only thing you can really go on is whether the string contains a "." (I assume you don't need to handle "internal" servers?). You could also validate against known domains and check for invalid characters, but beyond that pretty much anything goes.

As for actual implementation, regex would be the way to go if you can stomach it.. there's no doubt countless examples of validating URLs if you Google.

Steven Robbins 2009-02-09 08:59:42

Answer 4

A:

If you're not going to enforce it to be a valid URI (I.e. you make the scheme optional) then the only real option is to try and connect to it via HTTP.

Rowland Shaw 2009-02-09 08:59:44

Answer 5

A:

I think the quickest way to do this would be through a Regular Expression test. This however will not prove whether its a valid URL

FailBoy 2009-02-09 09:06:06

Answer 6

+3 A:

First, try to validate if the input text is a well-formed URL by using regular expressions. If the check is OK, try a DNS lookup to validate if the host is known. Don't forget the special case of localhost or 127.0.0.1. Also take care of hosts specified by their IP address. If these checks are OK, you may want to try an actual connection.

If these checks fail, you can modify the input text and check again. Possible modifications include:

prepend http://
prepend www.
append .com, .org, .net, whatever
append :8080, :8888, whatever
mix any of the above solutions
try also prepending file:/// for a local access

mouviciel 2009-02-09 09:07:37

+1 for the intent of maximizing usability. Also, some browsers etc will give a Google search if you just enter "Barcelona" in the address bar, and that is not always a bad thing (though of course it may be a bad thing in the OP's context - which he should have explained better).

Daniel Daranas 2009-02-09 09:14:36

I would feel uneasy about connecting to the given URL because of the security implications, especially if the poster is not 100% sure what he is doing.

zoul 2009-02-09 09:21:10

I think it could be safe if 'connecting' means only checking success or failure without recursively downloading every inline image, javascript, CSS, ... It may be performed for instance with text-based lynx.

mouviciel 2009-02-09 09:32:35

Yes, but there’s always the possibility that somebody passes someting like www.site.com/delete.php?all and hides his IP from the victim or people could pass file:///usr/lib/foo and check if file exists on your system etc.

zoul 2009-02-09 09:37:55

To put it another way: I would not cross the line of “100% safe” for something as minor as an URL check. There are also additional problems: what if the target site is down now and will be back in ten minutes?

zoul 2009-02-09 09:41:30

These are very interesting points, together with the ones mentionned by Ola Bini (e.g., denial-of-service argument). The regex solution does not fall in these drawbacks but ensures only that a text is a well-formed URL, not a valid web address. No solution is ideal. the answer may be in middle way.

mouviciel 2009-02-09 09:52:51

You could do the DNS check without the connection check. This should stop issues with security.

seanyboy 2009-02-09 17:22:56

DNS check ensures that the host is known, not that it runs a HTTP server. And at the end, verifying if a text is a valid web address suggests that the user will want to use that text to connect to the address.

mouviciel 2009-02-09 18:45:38

Answer 7

A:

See Regexp::Common on CPAN, especially R::C::URI and R::C::URI::http. Even if you can’t use the modules themselves, there are the regular expressions in the source. This is a good start.

zoul 2009-02-09 09:09:02

Answer 8

+3 A:

Notice that the following two are also valid web addresses. Do you want to allow them?

localhost
208.77.188.166

Konrad Rudolph 2009-02-09 09:09:11

Answer 9

+1 A:

Can you do a DNS lookup from your application, this will get round any "i'm not sure if it's a real address".

Greg B 2009-02-09 09:09:50

Answer 10

+1 A:

You could use the validation feature of Zend_Uri

Jack Sleight 2009-02-09 09:13:04

Answer 11

+5 A:

Apologies for the ensuing expression but it seems to capture most (if not all) cases :

^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~/|/)?(?#Username:Password)(?:\w+:\w+@)?
(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)
(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)
(?::[\d]{1,5})?(?#Directories)(?:(?:(?:/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|/)+|\?|#)?
(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)
(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)
(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})+)?(?#What not to end in)[^.!,:;?]$

Learning 2009-02-09 09:14:05

In your section #TopLevelDomains, you can add whatever seems appropriate to your needs. I think of local LANs with .corp or .local domains.

mouviciel 2009-02-09 09:19:54

+1, but where does this come from? Is there a testing suite? There are already too many sites with botchered URL validation where somebody blindly copied a regexp from a web forum…

zoul 2009-02-09 09:24:35

Don't forget the oft used .museum!

Simucal 2009-02-09 10:12:39

+1 for giving the expression and not only the typical "how about using regex" or similar generic answers. This is what's making StackOverflow *really* helpful

Kai 2009-02-09 10:24:03

+1 Well deserved. :)

Aaron Digulla 2009-02-09 14:10:48

Answer 12

+3 A:

My recommendation would be to not validate exactly at all. Instead, use a regular expression based approach, and if that doesn't match you can give a soft warning: "what you wrote doesn't look like a valid address. are you sure this is what you want to write?".

Definitely do not follow the idea of trying to connect to the address. That would open you up for all kinds of nasty security problems, including having your web site used for denial-of-service attacks against other web sites. That would land you in legal trouble.

Doing a DNS lookup is costly, but viable if you deem it's worth the cost.

Ola Bini 2009-02-09 09:35:58

ansaurus

tags:

views:

answers:

How to tell if a string is a web address?

related questions