ansaurus

Question

What's the fastest way to test the validity of a large number of well-formed URLs

Answer 1

+3 A:

Just send HTTP HEAD requests as shown in the accepted answer to this question.

Bill the Lizard 2009-02-18 23:54:51

Answer 2

+6 A:

I'm assuming you want to do it in Python based on your tags. In that case, I'd use httplib. Optionally, somehow group the URLs by host so you can make multiple requests in one connection for those URLs that have the same host. Use the HEAD request.

conn = httplib.HTTPConnection("example.com")
conn.request("HEAD", "/index.html")
resp = conn.getresponse()
print resp.status

Jeff 2009-02-18 23:56:34

Answer 3

+6 A:

To really make this fast you might also use eventlet which uses non-blocking IO to speed things up.

You can use a head request like this:

from eventlet import httpc
try:
    res = httpc.head(url)
except httpc.NotFound:
    # handle 404

You can then put this into some simple script like that example script here. With that you should get pretty much concurrency by using a coroutines pool.

MrTopf 2009-02-18 23:59:04

Thanks MrTopf - hey remember me... we met at Plone conf or was it pycon all those years ago? Thanks.

Salim Fadhley 2009-02-19 00:15:38

Yes, I remember you and it was probably EuroPython back in Gothenburg. I think we also met in London sometime. Hope it works for you :-)

MrTopf 2009-02-19 00:21:41

So is this actually used by 2ndlife? Are you an employee of that company now? Yes - it was Gothenburg! :-)

Salim Fadhley 2009-02-20 14:05:08

yes, it is used in Second Life and was developed further there. Donovan left Linden Lab in the meanwhile though and is working on that on his own. And no, I am not a LL employee but working with them on standardizing virtual world protocols.

MrTopf 2009-02-20 14:45:26

Answer 4

+1 A:

Instead of sending an HTTP GET request for each URL you can try sending an HTTP HEAD request. They are described in this document.

David Locke 2009-02-18 23:59:55

Answer 5

+4 A:

Using httplib and urlparse:

def checkURL(url):
    import httplib
    import urlparse

    protocol, host, path, query, fragment = urlparse.urlsplit(url)

    if protocol == "http":
        conntype = httplib.HTTPConnection
    elif protocol == "https":
        conntype = httplib.HTTPSConnection
    else:
        raise ValueError("unsupported protocol: " + protocol)

    conn = conntype(host)
    conn.request("HEAD", path)
    resp = conn.getresponse()
    conn.close()

    if resp.status < 400:
        return true

    return false

Ben Blank 2009-02-19 00:00:00

Answer 6

A:

This is a trivial case for twisted. There are a couple of concurrency tools you can use to slow it down, otherwise, it'll pretty much do it all at once.

Twisted is definitely my favorite thing about python. :)

Dustin 2009-02-19 01:30:44

Answer 7

A:

This might help you to start. The file sitelist.txt contains a list of URIs. You might have to install httplib2, highly recommended. I put a sleep between each request so if you have many URIs on the same site, your client will not be blacklisted for abusing resources.

   import httplib2
   import time

   h = httplib2.Http(".cache")

   f = open("sitelist.txt", "r")
   urllist = f.readlines()
   f.close()

   for url in urllist:
      # wait 10 seconds before the next request - be nice with the site
      time.sleep(10)
      resp= {}
      urlrequest = url.strip()
      try:
         resp, content = h.request(urlrequest, "HEAD")
         if resp['status'] == "200":
            print url, "200 - Good"
         else:
            print url, resp['status'], " you might want to double check"
      except:
         pass

karlcow 2009-02-24 19:34:30

Answer 8

A:

A Python program which does a similar work (for a list of URL stored at del.icio.us) is disastrous.

And, yes, it uses HEAD and not GET but do note some (not HTTP standard) servers send different results for HEAD and for GET: the Python environment Zope is a typical culprit.(Also, in some case, network problems, for instance tunnels + broken firewalls which block ICMP, prevent big packets to get through so HEAD works and not GET.)

bortzmeyer 2009-02-26 13:21:16

ansaurus

tags:

views:

answers:

What's the fastest way to test the validity of a large number of well-formed URLs

related questions