I'm writing a spider that needs a load_url
function that performs the following for me:
- Retry the URL if there is a temporary error, without leaking exceptions.
- Not leak memory or file handles
- Use HTTP-KeepAlive for speed (optional)
URLGrabber looks great on the surface, but it has trouble. The first I hit a problem with too many files open, but I was able to workaround this by turning keep-alive off. Then, the function started raising a socket.error: [Errno 104] Connection reset by peer
. That error should have been caught and possibly a URLGrabberError should be raised in it's place.
I'm running python 2.6.4.
Does anyone know of a way to fix these issues with URLGrabber or know of another way to accomplish what I need with a different library?