I need to do a fairly extensive project involving web scraping and am considering using Hpricot or Beautiful Soup (i.e. Ruby or Python). Has anyone come across a tutorial that they thought was particularly good on this subject that would help me start the project off on the right foot?
views:
751answers:
6Two of my favorite tools for Python web scraping are Scrapy and Mechanize. Each of these projects has its own tutorial and best practices.
Not a tool, really, but a good discussion is Michael Shrenk's book, Webbots, Spiders, and Screen Scrapers.
The book succeeds very well in its stated mission: explaining how to build simple web bots and operate them in accordance with community standards. It’s not everything you need to know, but it’s the best introduction I’ve seen. The focus is on simple, single-threaded, bots. There’s some small mention of using multiple bots that store data in a central repository, but there’s no discussion of the issues involved in writing multi-threaded or distributed bots that can process hundreds of pages per second.
I recommend that you read this book if you’re at all interested in writing Web bots, even if you’re not familiar with or intending to use PHP. But be sure not to expect more than the book offers.
For Ruby, the Scrubyt web-scraping toolkit is excellent. Here's an extensive introduction to it, which is worth reading even if you'll be using some other tool.
Look into using lxml instead of BeautifulSoup. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame - lxml just isn't as vocal about it). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
Take a look at the following screencasts:
- http://railscasts.com/episodes/190-screen-scraping-with-nokogiri
- http://railscasts.com/episodes/191-mechanize
Or if you like it plain, the corresponding asciicasts: