ansaurus

Question

Is there a better library than urlgrabber for fetching remote urls in python?

Answer 1

A:

Screen-scraping? Check out BeautifulSoup

Kimvais 2010-01-11 08:33:06

Answer 2

A:

Also for screen scraping in Python there is Scrapy though I don't know if this will fulfil your requirements.

bobwah 2010-01-11 08:40:31

Answer 3

+3 A:

If you are writing a web-crawler / screen-scraper, you may be interested to look at a dedicated framework such as scrapy.

You can write really quite sophisticated web crawlers with very little code: it takes care of all the gory details of scheduling the requests and calling you back with the results for you to process in whatever way you need (it's based on twisted but it hides all the implementation details away from you nicely).

jkp 2010-01-11 08:41:11

Does scrapy play nice with django? I need to get everything I scrape into a django/mysql DB.

Gattster 2010-01-11 08:44:14

I'm assuming you want to create records in your django back-end based on the results of your scraping excersise? If so, then all you need to do is transform the results your crawler returns into django model records (I haven't used django in a long time but iirc it uses SQLAlchemy) and call the relevant methods to save those records into your database: there is nothing different about scrapy based code to any other python code (IE, if you can get the data in using normal Python scripts, you can do it with a scrapy based scraper :))

jkp 2010-01-11 08:49:24

No, it does not use SQLAlchemy. Still, putting the records in the database wouldn't be too hard. You'd have to set the `DJANGO_SETTINGS_MODULE` environment variable to the import path of a settings file with DB connections, then just create and save models like you would in a Django view.

LeafStorm 2010-01-11 11:34:21

Answer 4

A:

Scrapy sounds great, and I will consider using it in the future. For this project however, I'm really looking for a simple function as described above. I have created one that seems to be getting the job done.

import urllib2

class HttpLoadError(RuntimeError):
    pass

class Http404(HttpLoadError):
    pass

class HttpFailedRepeatedly(HttpLoadError):
    pass

def safeurlopen(url):
    import time
    i = 0
    while True:
        i += 1
        try:
            return urllib2.urlopen(url)
        except (urllib2.HTTPError, socket.error), e:
            if getattr(e, 'code', '') == 404:
                raise Http404
            if i >= 10:
                raise HttpFailedRepeatedly(e)
            time.sleep(30)

def safeurlopenandread(url):
    rh = safeurlopen(url)
    res = rh.read()
    rh.close()
    return res

Gattster 2010-01-11 16:57:54

Answer 5

A:

The methods employed by the Harvestman crawler might be worth studying.

Noufal Ibrahim 2010-01-11 17:04:33

ansaurus

tags:

views:

answers:

Is there a better library than urlgrabber for fetching remote urls in python?

related questions