views:

21

answers:

1

I'm looking for suggestions regarding scraping toolkits. The solution need not be very tolerant of malformed HTML or able to adapt to many different situations. It doesnt need to be very scalable, it will be run at most once daily. It needs to do one thing and do it well: scrape HTML from a specific site.

I would rather use a css selector based scraper than an XPath one, as the former would be simpler to use given that i only want to scrape HTML.

I'm looking into scrAPI, but it is no longer being developed. I am afraid it won't be ported to ruby 1.9x. I ran into [bugs] in the (required) tidylib gem that had to be manually fixed http://bit.ly/beZHMR. Bottom line is, I don't want to build a solution that will gradually be putting itself out of business.

I looked into several other options (scRUBYt, Scrapy, Beautiful Soup), but none of them fit both requirements:

A) use ruby/rails or php

B) use css selector not xpath (unless I am overstating the complexity the latter will add to the job)

I even looked at http://mozenda.com but their tool choked on the first job and their support still hasnt gotten back to me.

Could anyone suggest a scraping toolkit that does fits the requirement?

thank you.

A: 

I opened a similar topic @ http://stackoverflow.com/questions/3357303/whats-a-good-complete-php-mysql-screen-scraper-project

You may find PHP Simple HTML DOM Parser useful, though honestly I haven't tried it yet.

Anthony Ryan-Lorraine