views:

385

answers:

4

I'm writing an app that takes in HTML code of a page and extracts certain elements (such as tables) of the page and returns the html code for those elements. I'm attempting to do this in java using the Mozilla parser to simplify the navigation through the page, but I'm having trouble extracting the html code needed.

Maybe my whole approach is wrong, aka Mozilla parser, so if there are better solutions, I'm open to suggestions

String html = ///what ever the code is

MozillaParser p = // instantiate parser


// pass in html to parse which creates a dom object
Document d = p.parse(html);

// get a list of all the form elements in the page
NodeList l =  d.getElementsByTagName("form");

// iterate through all forms
for(int i = 0; i < l.getLength(); i++){

    // get a form
    Node n = l.item(i);

    // print out the html code for just this form.
    // This is the portion I haven't figured out.
    // I just made up the innerHTML method, but thats
    // the end result I'm desiring, a way to just see
    // the html code for a particular node
    System.out.println( n.innerHTML() );
}
+1  A: 

I've had a measure of success using htmlcleaner (http://htmlcleaner.sourceforge.net/): it's pretty quick and has options to let you determine how "strict" it should be. I try to avoid html scraping wherever possible, though, for all the obivous reasons (data exposed via REST or other form of API tends to be more reliable, legal, easier to parse etc.etc.).

davek
+1  A: 

Try Jaxer. It's the Firefox engine with the UI replaced with Apache, more or less.

Your code runs in Jaxer, can retrieve pages from the other server, use JS to extract the bits you want, and then do what you want with the HTML using Jaxer's other APIs. You can write the HTML to a file, send it on to another server, send it to a web client in response to an HTTP request, whatever.

Warren Young
+1  A: 

Mozilla parser seems like overkill here, I've used Jericho with some success for just the type of thing you are doing.

Byron Whitlock
Yea, this looks like a good option. I was getting the feeling that Mozilla was a little too much
Kevin
Thanks, messing around with this and it'll get the job done.
Kevin
A: 

I have coded an HTML wrapper with Javascript on Mozilla platform. I pack the codes into two extensions to Firefox browser. One, called as MetaStudio, is a data schema definition tool which annotate Web pages semantically. The other, called as DataScraper, is a tool to extract data snippets from Web pages and formatted them into XML files.

All source codes are readable. Please go to http://www.gooseeker.com to download them.