views:

453

answers:

4

I want to be able to grab content from web pages, especially the tags and the content within them. I have tried XQuery and XPath but they don't seem to work for malformed XHTML and REGEX is just a pain.

Is there a better solution. Ideally I would like to be able to ask for all the links and get back an array of URLs, or ask for the text of the links and get back an array of Strings with the text of the links, or ask for all the bold text etc.

+4  A: 

Run the XHTML through something like JTidy, which should give you back valid XML.

Jay Kominek
+2  A: 

You may want to look at Watij. I have only used its Ruby cousin, Watir, but with it I was able to load a webpage and request all URLs of the page in exactly the manner you describe.

It was very easy to work with - it literally fires up a webbrowser and gives you back information in nice forms. IE support seemed best, but at least with Watir Firefox was also supported.

Joshua McKinnon
I've used Watij and it works very nicely
Brian Agnew
+2  A: 

I had some problems with JTidy back in the day. I think it was related to tags that weren't closed that made JTidy fail. I don't know if thats fixed now. I ended up using something that was a wrapper around TagSoup, although I don't remember the exact project's name. Theres also HTMLCleaner.

John Ellinwood
+2  A: 

I've used http://htmlparser.sourceforge.net/. It can parse poorly formed html and allows data extraction quite easily.

Marcelo Morales
this seems very similar to the .NET "HTML Agility Pack" which I use to do exactly what is wanted (get data from the HTML using xPath - even when it is not well formed)
Dror