Requirements
- Written in PHP
- Control over the code (open source would be awesome, purchasing code is an option too)
Optional features
- Listen to robots.txt
- Automatic rate limiting
- Scrape based on rules into a data object
- Admin interface, or configurable back end, to setup new rules
- Something like CSS selectors to pick our data in the rules
- Periodic / importance to update
- Logs errors / alerts an appropriate party when need to update rules
- Written with the PHP Symphony framework would be astounding, but I'm not expecting this
- MySQL backend
- Other things I'm not thinking of that are important to screen scraping in general
I know I won't get everything I want in the optional features - I'm mainly looking for something decently developed rather than re-inventing the wheel.
I've seen pieces like PHP Simple HTML DOM Parser noted in HTML Scraping in Php. I will build a custom solution if needed, so anything that might help even if not a complete solution is appreciated.