Is there another way to do screen scaping apart from regular expressions?

views:

271

answers:

Is there another way to do screen scaping apart from regular expressions?

I'm doing a personal, just for fun, project that is using screen scraping to give me a System Tray notification in case another line on an HTML table is added, modified or deleted.

Having done this before I thought: well let's go with the regular expression thing and that's it, but being a curious person, made me think that there could be something else out there that could have another paradigm but be as simple to use.

I know about DOM and X-Path and all the xml'ish approaches. I'm looking for something outside the box, something that can even be defined in a set of rules so you can make a plugin system to aggregate various sites.

+2 A:

See Options for HTML Scraping

jrudolph 2008-09-17 07:42:54

If you can convert the source into valid XHTML/XML using something like SgmlReader or HtmlTidy then you could use XSLT. Simply create a XSL template for each site you wish to scrape.

Macka 2008-09-17 07:43:04

Now there are two problems--parsing the HTML and managing XSLT, and the "solution" is harder than the original problem.

Rob Williams 2008-11-21 18:18:58

Here's an idea: assuming your main use case is getting a notification whenever an HTML file changes, why not use a standard diff tool and then loop through the changed lines, applying your rules?

Also, if this is a situation where you have access to the server and the files you're watching, you might be able to put everything under source control with CVS (or similar) and just watch for commits. If you want to use this approach for random sites on the web, just write a script that periodically downloads the html for the appropriate URLs and then commits it to source control and watch the diffs.

Not very practical, but outside the box.

Jason Morrison 2008-09-17 07:44:51

ansaurus

tags:

views:

answers:

Is there another way to do screen scaping apart from regular expressions?

related questions