views:

654

answers:

5

Hi!

I already searched a long time for a good solution, but I can't find anything that fits my needs...

I want to parse an HTML file and display its content in a table. Everything is almost like writing yet another RSS feed reader. Doing that by parsing valid XML files is simple and straight forward using NSXMLParser or TouchXML or libxml directly or some other XML parseres out there... But these frameworks either only work with XML and/or are not working with non-tidy HTML. The site consists of divs including links that include images or paragraphs including links and images etc. etc... just a normal website. Using libxml seems way too complicated in that case.

Does somebody have more experience with parsing dirty HTML pages? Which (free) library/framework did you use? I have the feeling that I just miss something obvious here. It can't be that difficult to parse HTML files, or not?

I hope you can point me to the right direction!

+1  A: 

I have zero experience but... Can't you use WebKit's parser? I guess it should expose some kind of DOM without necessarily having to render the page.

Nicolás
Nope. You can't include webkit directly on the iphone. Only UIWebView, which doesn't expose anything under the hood.
Kenny Winker
+1  A: 

WebKit should handle dirty HTML and allows you to access the DOM tree using the "Page" and "Frame" classes. Those contain functions to find elements by ID and so on.

BastiBense
I just had a look at that... unfortunately WebKit is a private framework on iPhone OS, so that would prevent me from getting into the app store :(
Hutaffe
UIWebView is the embedded version of WebKit that is app store certified.
slebetman
See: http://drnicwilliams.com/2008/11/10/to-webkit-or-not-to-webkit-within-your-iphone-app/
slebetman
+1  A: 

Checkout the libxml2 library that's also on iPhone and comes with an inbuilt HTML parser. Claims to handle real-world HTML:

this module implements an HTML 4.0 non-verifying parser with API compatible with the XML parser ones. It should be able to parse "real world" HTML, even if severely broken from a specification point of view.
Anurag
+2  A: 

If you need to parse most of the page, trying to use libXML2 as per Anurag is a good idea.

If you just want small segments of data from the file, you are better off using RegEx expressions to read out data - there's also a built-in regex library, which you can use the wrapper RegExKitLite to access.

Kendall Helmstetter Gelner
Well... Seems like I have to go the hard way using RegEx with libXML together. Thanks for the link to RegExKit!
Hutaffe
+1  A: 

I had to do this some time ago. Eventually I ended up using HTML Tidy to clean up the HTML before parsing it using TouchXML.

When I did this, the HTML Tidy docs weren't very clear (IMHO) so I had to dig around a bit to find out how it actually worked. If don't have much time now but I can look up the code I came up with if you want.

The source (and more) of HTML Tidy can be found here. http://tidy.sourceforge.net/

Rengers