Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate.
Does such a library exist, or am I better off just trying to use regular expressions?
Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate.
Does such a library exist, or am I better off just trying to use regular expressions?
This probably depends on how messy the HTML is and what you want to extract. But usually Tidy does quite a good job. It is written in C and I guess you should be able to build and statically link it for the iPhone. You can easily install the command line version and test the results first.
Looks like libxml2.2
comes in the SDK, and libxml/HTMLparser.h
claims the following:
This module implements an HTML 4.0 non-verifying parser with API compatible with the XML parser ones. It should be able to parse "real world" HTML, even if severely broken from a specification point of view.
That sounds like what I need, so I'm probably going to use that.
Google's GData Objective-C API reimplements NSXMLElement and other related classes that Apple removed from the iPhone SDK. You can find it here http://code.google.com/p/gdata-objectivec-client/. I've used it for dealing messaging via Jabber. Of course if your HTML is malformed (missing closing tags) this might not help much.
You may want to check out ElementParser. It provides "just enough" parsing of HTML and XML. Nice interfaces make walking around XML / HTML documents very straightforward. http://touchtank.wordpress.com/
I found using hpple quite useful to parse messy HTML. Hpple project is a Objective-C wrapper on the XPathQuery library for parsing HTML. Using it you can send an XPath query and receive the result .
Requirements:
-Add libxml2 includes to your project
-Add libxml2 library to to your project
-From hpple get the following source code files an add them to your project:
-Take a walk on w3school XPath Tutorial to feel comfortable with the XPath language.
Code Example
#import "TFHpple.h"
NSData *data = [[NSData alloc] initWithContentsOfFile:@"example.html"];
// Create parser
xpathParser = [[TFHpple alloc] initWithHTMLData:data];
//Get all the cells of the 2nd row of the 3rd table
NSArray *elements = [xpathParser search:@"//table[3]/tr[2]/td"];
// Access the first cell
TFHppleElement *element = [elements objectAtIndex:0];
// Get the text within the cell tag
NSString *content = [element content];
[xpathParser release];
[data release];
Known issues
As hpple is a wrapper over XPathQuery which is another wrapper, this option probably is not the most efficient. If performance is an issue in your project, I recommend to code your own lightweight solution based on hpple and xpathquery library code.
I wrote a lightweight wrapper around libxml which maybe useful:
When I use libxml2 in my ipad project (i use the dylib and add the header flags correctly) it doesn't build and gives the error:
/Xcode4/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator3.2.sdk/usr/include/libxml2/libxml/xmlversion.h:24
Expected '=','','.','asm' or 'atrrbitue' before 'void'.
line 23-25 of xmlversion.h is
#ifndef LIBXML2_COMPILING_MSCCDEF
XMLPUBFUN void XMLCALL xmlCheckVersion(int version);
#endif /* LIBXML2_COMPILING_MSCCDEF */
What am i doing wrong ?
Thanks in advance! Kristof