ansaurus

Question

Answer 1

+1 A:

I've had a measure of success using htmlcleaner (http://htmlcleaner.sourceforge.net/): it's pretty quick and has options to let you determine how "strict" it should be. I try to avoid html scraping wherever possible, though, for all the obivous reasons (data exposed via REST or other form of API tends to be more reliable, legal, easier to parse etc.etc.).

davek 2009-09-29 20:15:13

Answer 2

+1 A:

Try Jaxer. It's the Firefox engine with the UI replaced with Apache, more or less.

Your code runs in Jaxer, can retrieve pages from the other server, use JS to extract the bits you want, and then do what you want with the HTML using Jaxer's other APIs. You can write the HTML to a file, send it on to another server, send it to a web client in response to an HTTP request, whatever.

Warren Young 2009-09-29 20:15:53

Answer 3

+1 A:

Mozilla parser seems like overkill here, I've used Jericho with some success for just the type of thing you are doing.

Byron Whitlock 2009-09-29 20:20:50

Yea, this looks like a good option. I was getting the feeling that Mozilla was a little too much

Kevin 2009-09-29 20:28:48

Thanks, messing around with this and it'll get the job done.

Kevin 2009-09-29 21:12:50

Answer 4

A:

I have coded an HTML wrapper with Javascript on Mozilla platform. I pack the codes into two extensions to Firefox browser. One, called as MetaStudio, is a data schema definition tool which annotate Web pages semantically. The other, called as DataScraper, is a tool to extract data snippets from Web pages and formatted them into XML files.

All source codes are readable. Please go to http://www.gooseeker.com to download them.

2009-10-01 07:53:05

ansaurus

tags:

views:

answers:

Mozilla Parser for screen scraping

related questions