What is the best way to screen scrape poorly formed XHTML pages for a java app

views:

453

answers:

+4 Q:

What is the best way to screen scrape poorly formed XHTML pages for a java app

I want to be able to grab content from web pages, especially the tags and the content within them. I have tried XQuery and XPath but they don't seem to work for malformed XHTML and REGEX is just a pain.

Is there a better solution. Ideally I would like to be able to ask for all the links and get back an array of URLs, or ask for the text of the links and get back an array of Strings with the text of the links, or ask for all the bold text etc.

+4 A:

Run the XHTML through something like JTidy, which should give you back valid XML.

Jay Kominek 2009-04-03 15:09:45

+2 A:

You may want to look at Watij. I have only used its Ruby cousin, Watir, but with it I was able to load a webpage and request all URLs of the page in exactly the manner you describe.

It was very easy to work with - it literally fires up a webbrowser and gives you back information in nice forms. IE support seemed best, but at least with Watir Firefox was also supported.

Joshua McKinnon 2009-04-03 15:10:20

I've used Watij and it works very nicely

Brian Agnew 2009-04-03 15:23:50

+2 A:

I had some problems with JTidy back in the day. I think it was related to tags that weren't closed that made JTidy fail. I don't know if thats fixed now. I ended up using something that was a wrapper around TagSoup, although I don't remember the exact project's name. Theres also HTMLCleaner.

John Ellinwood 2009-04-03 15:12:51

+2 A:

I've used http://htmlparser.sourceforge.net/. It can parse poorly formed html and allows data extraction quite easily.

Marcelo Morales 2009-04-03 18:53:48

this seems very similar to the .NET "HTML Agility Pack" which I use to do exactly what is wanted (get data from the HTML using xPath - even when it is not well formed)

Dror 2009-04-07 06:57:27

ansaurus

tags:

views:

answers:

What is the best way to screen scrape poorly formed XHTML pages for a java app

related questions