tags:

views:

286

answers:

8

My project requires me to validate a large number of web URLs. These URLs have been captured by a very unreliable process which I do not control. All of the URLs have already been regexp validated and are known to be well-formed. I also know that they all have valid TLDs

I want to be able to filter these URLs quickly in order to determine which of these are incorrect. At this point I do not care what content is on the pages - I'd just like to know as quickly as possible which of the pages are inaccessible (e.g. produce a 404 error).

Given that there are a lot of these I do not want to download the entire page, just the HTTP header and then take a good guess from the content of the header whether the page is likely to exist.

Can it be done?

+3  A: 

Just send HTTP HEAD requests as shown in the accepted answer to this question.

Bill the Lizard
+6  A: 

I'm assuming you want to do it in Python based on your tags. In that case, I'd use httplib. Optionally, somehow group the URLs by host so you can make multiple requests in one connection for those URLs that have the same host. Use the HEAD request.

conn = httplib.HTTPConnection("example.com")
conn.request("HEAD", "/index.html")
resp = conn.getresponse()
print resp.status
Jeff
+6  A: 

To really make this fast you might also use eventlet which uses non-blocking IO to speed things up.

You can use a head request like this:

from eventlet import httpc
try:
    res = httpc.head(url)
except httpc.NotFound:
    # handle 404

You can then put this into some simple script like that example script here. With that you should get pretty much concurrency by using a coroutines pool.

MrTopf
Thanks MrTopf - hey remember me... we met at Plone conf or was it pycon all those years ago? Thanks.
Salim Fadhley
Yes, I remember you and it was probably EuroPython back in Gothenburg. I think we also met in London sometime. Hope it works for you :-)
MrTopf
So is this actually used by 2ndlife? Are you an employee of that company now? Yes - it was Gothenburg! :-)
Salim Fadhley
yes, it is used in Second Life and was developed further there. Donovan left Linden Lab in the meanwhile though and is working on that on his own. And no, I am not a LL employee but working with them on standardizing virtual world protocols.
MrTopf
+1  A: 

Instead of sending an HTTP GET request for each URL you can try sending an HTTP HEAD request. They are described in this document.

David Locke
+4  A: 

Using httplib and urlparse:

def checkURL(url):
    import httplib
    import urlparse

    protocol, host, path, query, fragment = urlparse.urlsplit(url)

    if protocol == "http":
        conntype = httplib.HTTPConnection
    elif protocol == "https":
        conntype = httplib.HTTPSConnection
    else:
        raise ValueError("unsupported protocol: " + protocol)

    conn = conntype(host)
    conn.request("HEAD", path)
    resp = conn.getresponse()
    conn.close()

    if resp.status < 400:
        return true

    return false
Ben Blank
A: 

This is a trivial case for twisted. There are a couple of concurrency tools you can use to slow it down, otherwise, it'll pretty much do it all at once.

Twisted is definitely my favorite thing about python. :)

Dustin
A: 

This might help you to start. The file sitelist.txt contains a list of URIs. You might have to install httplib2, highly recommended. I put a sleep between each request so if you have many URIs on the same site, your client will not be blacklisted for abusing resources.

   import httplib2
   import time

   h = httplib2.Http(".cache")

   f = open("sitelist.txt", "r")
   urllist = f.readlines()
   f.close()

   for url in urllist:
      # wait 10 seconds before the next request - be nice with the site
      time.sleep(10)
      resp= {}
      urlrequest = url.strip()
      try:
         resp, content = h.request(urlrequest, "HEAD")
         if resp['status'] == "200":
            print url, "200 - Good"
         else:
            print url, resp['status'], " you might want to double check"
      except:
         pass
karlcow
A: 

A Python program which does a similar work (for a list of URL stored at del.icio.us) is disastrous.

And, yes, it uses HEAD and not GET but do note some (not HTTP standard) servers send different results for HEAD and for GET: the Python environment Zope is a typical culprit.(Also, in some case, network problems, for instance tunnels + broken firewalls which block ICMP, prevent big packets to get through so HEAD works and not GET.)

bortzmeyer