tags:

views:

112

answers:

6

I am trying to determine if there is a way to check the availability of a potentially large list of urls (> 1000000) without having to send a GET request to every single one.

Is it safe to assume that if http://www.example.com is inaccessible (as in unable to connect to server or the DNS request for the domain fails), or I get a 4XX or 5XX response, then anything from that domain will also be inaccessible (e.g. http://www.example.com/some/path/to/a/resource/named/whatever.jpg)? Would a 302 response (say for whatever.jpg) be enough to invalidate the first assumption? I imagine sub domains should be considered distinct as http://subdomain.example.com and http://www.example.com may not direct to the same ip?

I seem to be able to think of a counter example for each shortcut I come up with. Should I just bite the bullet and send out GET requests to every URL?

+7  A: 

Unfortunately, no you cannot infer anything from 4xx or 5xx or any other codes.

Those codes are for individual pages, not for the server. It's quite possible that one page is down and another is up, or one has a 500 server-side error and another doesn't.

What you can do is use HEAD instead of GET. That retrieves the MIME header for the page but not the page content. This saves time server-side (because it doesn't have to render the page) and for yourself (because you don't have to buffer and then discard content).

Also I suggest you use keep-alive to accelerate responses from the same server. Many HTTP client libraries will do this for you.

Jason Cohen
+1  A: 

The only assumption you should make about the availability of an URL is that "Getting an URL can and will fail".

It's not safe to assume that a sub domain request will fail when a parent one does. Namely because inbetween your two requests your network connection can go up, down or generally misbehave. It's also possible for the domains to be changed in between requests.

Ignoring all internet connection issues. You are still dealing with a live web site that can and will change constantly. What is true now might not be true in 5 minutes when they decide to alter their page structure or change the way the display a particular page. Your best bet is to assume any get will fail.

This may seem like an extreme view point. But these events will happen. How you handle them will determine the robustness of your program.

JaredPar
A: 

If the connection to the server actually fails, then there's no reason to check URLs on that server. Otherwise, you can't assume anything.

Boden
+3  A: 

A failed DNS lookup for a host (e.g. www.example.com) should be enough to invalidate all URLs for that host. Subdomains or other hosts would have to be checked separately though.

A 4xx code might tell you that a particular page isn't available, but you couldn't make any assumptions about other pages from that.

A 5xx code really won't tell you anything. For example, it could be that the page is there, but the server is just too busy at the moment. If you try it again later it might work fine.

Eric Petroelje
+1  A: 

First don't assume anything based on a single page failing. I have seen many cases where IIS will continue to serve static content but not be able to serve any dynamic content.

You have to treat each host name as unique you cannot assume subdomain.example.com and example.com point to the same IP. Or even if they do there is no guarentee that are the same site. IIS again has host headers that allows you to run multiple sites using a single IP Address.

JoshBerke
A: 

In addition to what everyone else is saying, use HEAD requests instead of GET requests. They function the same, but the response doesn't contain the message body, so you save everyone some bandwidth.

Pesto