views:

220

answers:

5

I'm writing a spider that needs a load_url function that performs the following for me:

  1. Retry the URL if there is a temporary error, without leaking exceptions.
  2. Not leak memory or file handles
  3. Use HTTP-KeepAlive for speed (optional)

URLGrabber looks great on the surface, but it has trouble. The first I hit a problem with too many files open, but I was able to workaround this by turning keep-alive off. Then, the function started raising a socket.error: [Errno 104] Connection reset by peer. That error should have been caught and possibly a URLGrabberError should be raised in it's place.

I'm running python 2.6.4.

Does anyone know of a way to fix these issues with URLGrabber or know of another way to accomplish what I need with a different library?

A: 

Screen-scraping? Check out BeautifulSoup

Kimvais
A: 

Also for screen scraping in Python there is Scrapy though I don't know if this will fulfil your requirements.

bobwah
+3  A: 

If you are writing a web-crawler / screen-scraper, you may be interested to look at a dedicated framework such as scrapy.

You can write really quite sophisticated web crawlers with very little code: it takes care of all the gory details of scheduling the requests and calling you back with the results for you to process in whatever way you need (it's based on twisted but it hides all the implementation details away from you nicely).

jkp
Does scrapy play nice with django? I need to get everything I scrape into a django/mysql DB.
Gattster
I'm assuming you want to create records in your django back-end based on the results of your scraping excersise? If so, then all you need to do is transform the results your crawler returns into django model records (I haven't used django in a long time but iirc it uses SQLAlchemy) and call the relevant methods to save those records into your database: there is nothing different about scrapy based code to any other python code (IE, if you can get the data in using normal Python scripts, you can do it with a scrapy based scraper :))
jkp
No, it does not use SQLAlchemy. Still, putting the records in the database wouldn't be too hard. You'd have to set the `DJANGO_SETTINGS_MODULE` environment variable to the import path of a settings file with DB connections, then just create and save models like you would in a Django view.
LeafStorm
A: 

Scrapy sounds great, and I will consider using it in the future. For this project however, I'm really looking for a simple function as described above. I have created one that seems to be getting the job done.

import urllib2

class HttpLoadError(RuntimeError):
    pass

class Http404(HttpLoadError):
    pass

class HttpFailedRepeatedly(HttpLoadError):
    pass

def safeurlopen(url):
    import time
    i = 0
    while True:
        i += 1
        try:
            return urllib2.urlopen(url)
        except (urllib2.HTTPError, socket.error), e:
            if getattr(e, 'code', '') == 404:
                raise Http404
            if i >= 10:
                raise HttpFailedRepeatedly(e)
            time.sleep(30)

def safeurlopenandread(url):
    rh = safeurlopen(url)
    res = rh.read()
    rh.close()
    return res
Gattster
A: 

The methods employed by the Harvestman crawler might be worth studying.

Noufal Ibrahim