Parsing dirty HTML on iPhone

views:

654

answers:

+2 Q:

Parsing dirty HTML on iPhone

Hi!

I already searched a long time for a good solution, but I can't find anything that fits my needs...

I want to parse an HTML file and display its content in a table. Everything is almost like writing yet another RSS feed reader. Doing that by parsing valid XML files is simple and straight forward using NSXMLParser or TouchXML or libxml directly or some other XML parseres out there... But these frameworks either only work with XML and/or are not working with non-tidy HTML. The site consists of divs including links that include images or paragraphs including links and images etc. etc... just a normal website. Using libxml seems way too complicated in that case.

Does somebody have more experience with parsing dirty HTML pages? Which (free) library/framework did you use? I have the feeling that I just miss something obvious here. It can't be that difficult to parse HTML files, or not?

I hope you can point me to the right direction!

+1 A:

I have zero experience but... Can't you use WebKit's parser? I guess it should expose some kind of DOM without necessarily having to render the page.

Nicolás 2010-01-09 01:06:17

Nope. You can't include webkit directly on the iphone. Only UIWebView, which doesn't expose anything under the hood.

Kenny Winker 2010-01-09 17:36:18

+1 A:

WebKit should handle dirty HTML and allows you to access the DOM tree using the "Page" and "Frame" classes. Those contain functions to find elements by ID and so on.

BastiBense 2010-01-09 01:11:28

I just had a look at that... unfortunately WebKit is a private framework on iPhone OS, so that would prevent me from getting into the app store :(

Hutaffe 2010-01-09 01:32:15

UIWebView is the embedded version of WebKit that is app store certified.

slebetman 2010-01-09 01:53:09

See: http://drnicwilliams.com/2008/11/10/to-webkit-or-not-to-webkit-within-your-iphone-app/

slebetman 2010-01-09 01:56:01

+1 A:

Checkout the libxml2 library that's also on iPhone and comes with an inbuilt HTML parser. Claims to handle real-world HTML:

this module implements an HTML 4.0 non-verifying parser with API compatible with the XML parser ones. It should be able to parse "real world" HTML, even if severely broken from a specification point of view.

Anurag 2010-01-09 01:23:10

+2 A:

If you need to parse most of the page, trying to use libXML2 as per Anurag is a good idea.

If you just want small segments of data from the file, you are better off using RegEx expressions to read out data - there's also a built-in regex library, which you can use the wrapper RegExKitLite to access.

Kendall Helmstetter Gelner 2010-01-09 01:46:03

Well... Seems like I have to go the hard way using RegEx with libXML together. Thanks for the link to RegExKit!

Hutaffe 2010-01-09 15:55:17

+1 A:

I had to do this some time ago. Eventually I ended up using HTML Tidy to clean up the HTML before parsing it using TouchXML.

When I did this, the HTML Tidy docs weren't very clear (IMHO) so I had to dig around a bit to find out how it actually worked. If don't have much time now but I can look up the code I came up with if you want.

The source (and more) of HTML Tidy can be found here. http://tidy.sourceforge.net/

Rengers 2010-01-09 17:30:46

ansaurus

tags:

views:

answers:

Parsing dirty HTML on iPhone

related questions