views:

7134

answers:

7

I'm half-tempted to write my own, but I don't really have enough time right now. I've seen the Wikipedia list of http://en.wikipedia.org/wiki/Web_crawler#Open-source_crawlers://">open source crawlers but I'd prefer something written in Python. I realize that I could probably just use one of the tools on the Wikipedia page and wrap it in Python. I might end up doing that - if anyone has any advice about any of those tools, I'm open to hearing about them. I've used Heritrix via its web interface and I found it to be quite cumbersome. I definitely won't be using a browser API for my upcoming project.

Thanks in advance. Also, this is my first SO question!

+1  A: 

I've used Ruya and found it pretty good.

kshahar
+2  A: 

Check the HarvestMan, a multi-threaded web-crawler written in Python, also give a look to the spider.py module.

And here you can find code samples to build a simple web-crawler.

CMS
+24  A: 
  • Mechanize is my favorite; great high-level browsing capabilities (super-simple form filling and submission).
  • Twill is a simple scripting language built on top of Mechanize
  • BeautifulSoup + urllib2 also works quite nicely.
  • Scrapy looks like an extremely promising project; it's new.
RexE
Add urrlib2 to Beautiful Soup and you have a good combination of tools.
S.Lott
+1 for mentioning scrapy
nosklo
Excellent, thanks!
Matt
those libraries can be used for crawling, but they are not crawlers themselves
Plumo
using scrapy, for example, it's really trivial to create your set of rules for a scraping. Haven't tried any others, but Scrapy is really nice piece of code.
Victor Farazdagi
+19  A: 

Use Scrapy.

It is a twisted-based web crawler framework. Still under heavy development but it works already. Has many goodies:

  • Built-in support for parsing HTML, XML, CSV, and Javascript
  • A media pipeline for scraping items with images (or any other media) and download the image files as well
  • Support for extending Scrapy by plugging your own functionality using middlewares, extensions, and pipelines
  • Wide range of built-in middlewares and extensions for handling of compression, cache, cookies, authentication, user-agent spoofing, robots.txt handling, statistics, crawl depth restriction, etc
  • Interactive scraping shell console, very useful for developing and debugging
  • Web management console for monitoring and controlling your bot
  • Telnet console for low-level access to the Scrapy process

Example code to extract information about all torrent files added today in the mininova torrent site, by using a XPath selector on the HTML returned:

class Torrent(ScrapedItem):
    pass

class MininovaSpider(CrawlSpider):
    domain_name = 'mininova.org'
    start_urls = ['http://www.mininova.org/today']
    rules = [Rule(RegexLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

    def parse_torrent(self, response):
        x = HtmlXPathSelector(response)
        torrent = Torrent()

        torrent.url = response.url
        torrent.name = x.x("//h1/text()").extract()
        torrent.description = x.x("//div[@id='description']").extract()
        torrent.size = x.x("//div[@id='info-left']/p[2]/text()[2]").extract()
        return [torrent]
nosklo
A: 

Another simple spider Uses BeautifulSoup and urllib2. Nothing too sophisticated, just reads all a href's builds a list and goes though it.

pulegium
+1  A: 

Here is a web crawler that I have written that can be easily modified to fit the needs of most projects. http://www.esux.net/content/simple-python-web-crawler-thats-easy-build

Zizzi
A: 

pyspider.py

Lars Bauer Jørgensen