Use Scrapy.
It is a twisted-based web crawler framework. Still under heavy development but it works already. Has many goodies:
- Built-in support for parsing HTML, XML, CSV, and Javascript
- A media pipeline for scraping items with images (or any other media) and download the image files as well
- Support for extending Scrapy by plugging your own functionality using middlewares, extensions, and pipelines
- Wide range of built-in middlewares and extensions for handling of compression, cache, cookies, authentication, user-agent spoofing, robots.txt handling, statistics, crawl depth restriction, etc
- Interactive scraping shell console, very useful for developing and debugging
- Web management console for monitoring and controlling your bot
- Telnet console for low-level access to the Scrapy process
Example code to extract information about all torrent files added today in the mininova torrent site, by using a XPath selector on the HTML returned:
class Torrent(ScrapedItem):
pass
class MininovaSpider(CrawlSpider):
domain_name = 'mininova.org'
start_urls = ['http://www.mininova.org/today']
rules = [Rule(RegexLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]
def parse_torrent(self, response):
x = HtmlXPathSelector(response)
torrent = Torrent()
torrent.url = response.url
torrent.name = x.x("//h1/text()").extract()
torrent.description = x.x("//div[@id='description']").extract()
torrent.size = x.x("//div[@id='info-left']/p[2]/text()[2]").extract()
return [torrent]