I am looking to scrape a website like yelp.com, to get a listing of all the bars they have there. Are there any tools or scripts out there which would help me do this.
From a Python perspective
- HTTPLib2 to automate the page downloads.
- Beautiful Soup for parsing the HTML source to get the info you want.
Read An Introduction to Compassionate Screen Scraping for good tutorial to get you started that uses both tools.
I wrote a scraper back in the dot-com era to suck info from a few e-commerce websites. I used perl and for each site had two tiers of code. The "discover" tier would parse and find lists of items and the "process" tier would read product pages and separate fields of data and feed them into a database.
From the looks of what you want to do I think rolling your own solution is probably the best approach as it's not really complicated. Use Perl or a similar interpreted language with good string processing and regex support.
Separating the pages is really easy. Forget about parse trees (I went that way and gave up on it), it's much easier and straightforward to manually identify the clumps of text of the template bordering each piece of info you want and put that on a regex to extract the data.
Put them on a list, hash, whatever and then do what you want with it.
I've done work like this on Superpages and citySearch using screen-scraper. From there you can write your results to a CSV, database, or whatever.
Last time when I was looking for such tool my friend suggested me Automation Anywhere. I feel its a nice tool because the best part is point-n-click extraction used. What you can do is look out for more info on this web scraping tool and use the free trial for a better idea. I learned about it on this screen scraping page. Have a view.
hmm i'm interested in this as well.. how can I use screen-scraper to get a portion of local businesses form yelp