views:

271

answers:

3

I'm doing a personal, just for fun, project that is using screen scraping to give me a System Tray notification in case another line on an HTML table is added, modified or deleted.

Having done this before I thought: well let's go with the regular expression thing and that's it, but being a curious person, made me think that there could be something else out there that could have another paradigm but be as simple to use.

I know about DOM and X-Path and all the xml'ish approaches. I'm looking for something outside the box, something that can even be defined in a set of rules so you can make a plugin system to aggregate various sites.

+2  A: 

See Options for HTML Scraping

jrudolph
A: 

If you can convert the source into valid XHTML/XML using something like SgmlReader or HtmlTidy then you could use XSLT. Simply create a XSL template for each site you wish to scrape.

Macka
Now there are two problems--parsing the HTML and managing XSLT, and the "solution" is harder than the original problem.
Rob Williams
A: 

Here's an idea: assuming your main use case is getting a notification whenever an HTML file changes, why not use a standard diff tool and then loop through the changed lines, applying your rules?

Also, if this is a situation where you have access to the server and the files you're watching, you might be able to put everything under source control with CVS (or similar) and just watch for commits. If you want to use this approach for random sites on the web, just write a script that periodically downloads the html for the appropriate URLs and then commits it to source control and watch the diffs.

Not very practical, but outside the box.

Jason Morrison