I have a list of urls (1000+) which have been stored for over a year now. I want to run through and verify them all to see if they still exist. What is the best / quickest way to check them all and return a list of ones which do not return a site?
+2
A:
this is kind of slow but you can use something like this to check if url is a live
import urllib2
try:
urllib2.urlopen(url)
return True # URL Exist
except ValueError, ex:
return False # URL not well formatted
except urllib2.URLError, ex:
return False # URL don't seem to be alive
more quick than urllib2 you can use httplib
import httplib
try:
a = httplib.HTTPConnection('google.com')
a.connect()
except httplib.HTTPException as ex:
print "not connected"
you can also do a DNS checkout (it's not very convenient to check if a website don't exist):
import socket
try:
socket.gethostbyname('www.google.com')
except socket.gaierror as ex:
print "not existe"
singularity
2010-10-28 09:32:37
is using socket faster than urllib2. I tried urllib2 but it took forever so I ended up stopping it
John
2010-10-28 15:31:42
i just edited my question , and i added a more quick solution using httplib , and for using ping (the other answer) or dns lookup(the third solution in my answer) is not very convenient, because many web site are still registered in the DNS and they don't exist anymore and for the ping it just like the DNS lookup + a ICMP ping which also don't say if the website (http server) is running "accepting connection" or not
singularity
2010-10-28 17:07:13
A:
Check this:
End then:
import ping, socket
try:
result = ping.do_one('http://stackoverflow.com/', timeout=2)
except socket.error, e:
# url cannot be reached
print "Error:", e
Klark
2010-10-28 09:34:31
I have over a 1000 urls to check. will this be faster than using the urllib2 answer below?
John
2010-10-28 15:30:47
I think it will. Test it. It also depends on the network. In every case it will take some time for server to response (you can set timeout in my solution, as you can see in the code).
Klark
2010-10-28 15:42:40