views:

15439

answers:

8

Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate.

Does such a library exist, or am I better off just trying to use regular expressions?

+3  A: 

This probably depends on how messy the HTML is and what you want to extract. But usually Tidy does quite a good job. It is written in C and I guess you should be able to build and statically link it for the iPhone. You can easily install the command line version and test the results first.

tcurdt
+25  A: 

Looks like libxml2.2 comes in the SDK, and libxml/HTMLparser.h claims the following:

This module implements an HTML 4.0 non-verifying parser with API compatible with the XML parser ones. It should be able to parse "real world" HTML, even if severely broken from a specification point of view.

That sounds like what I need, so I'm probably going to use that.

Ben Alpert
+1  A: 

Google's GData Objective-C API reimplements NSXMLElement and other related classes that Apple removed from the iPhone SDK. You can find it here http://code.google.com/p/gdata-objectivec-client/. I've used it for dealing messaging via Jabber. Of course if your HTML is malformed (missing closing tags) this might not help much.

dnolen
+3  A: 

You may want to check out ElementParser. It provides "just enough" parsing of HTML and XML. Nice interfaces make walking around XML / HTML documents very straightforward. http://touchtank.wordpress.com/

+14  A: 

I found using hpple quite useful to parse messy HTML. Hpple project is a Objective-C wrapper on the XPathQuery library for parsing HTML. Using it you can send an XPath query and receive the result .

Requirements:

-Add libxml2 includes to your project

  1. Menu Project->Edit Project Settings
  2. Search for setting "Header Search Paths"
  3. Add a new search path "${SDKROOT}/usr/include/libxml2"
  4. Enable recursive option

-Add libxml2 library to to your project

  1. Menu Project->Edit Project Settings
  2. Search for setting "Other Linker Flags"
  3. Add a new search flag "-lxml2"

-From hpple get the following source code files an add them to your project:

  1. HTFpple.h
  2. HTFpple.m
  3. HTFppleElement.h
  4. HTFppleElement.m
  5. XPathQuery.h
  6. XPathQuery.m

-Take a walk on w3school XPath Tutorial to feel comfortable with the XPath language.

Code Example

#import "TFHpple.h"

NSData *data = [[NSData alloc] initWithContentsOfFile:@"example.html"];

// Create parser
xpathParser = [[TFHpple alloc] initWithHTMLData:data];

//Get all the cells of the 2nd row of the 3rd table 
NSArray *elements  = [xpathParser search:@"//table[3]/tr[2]/td"];

// Access the first cell
TFHppleElement *element = [elements objectAtIndex:0];

// Get the text within the cell tag
NSString *content = [element content];  

[xpathParser release];
[data release];

Known issues

As hpple is a wrapper over XPathQuery which is another wrapper, this option probably is not the most efficient. If performance is an issue in your project, I recommend to code your own lightweight solution based on hpple and xpathquery library code.

Albaregar
I used this just now, and it worked very well so far.
Karsten Silz
Seconded: great find :)
jkp
+1  A: 

I wrote a lightweight wrapper around libxml which maybe useful:

http://benreeves.co.uk/objective-c-hmtl-parser/

Ben Reeves
Looks great Ben. I may be using it in my upcoming iPad application.
Brock Woolf
+1  A: 

Have you tried WonderXML: http://wonderxml.googlecode.com/

Luke Du
A: 

When I use libxml2 in my ipad project (i use the dylib and add the header flags correctly) it doesn't build and gives the error:

/Xcode4/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator3.2.sdk/usr/include/libxml2/libxml/xmlversion.h:24

Expected '=','','.','asm' or 'atrrbitue' before 'void'.

line 23-25 of xmlversion.h is

#ifndef LIBXML2_COMPILING_MSCCDEF
XMLPUBFUN void XMLCALL xmlCheckVersion(int version);
#endif /* LIBXML2_COMPILING_MSCCDEF */

What am i doing wrong ?

Thanks in advance! Kristof

Kristof
You might want to post this as a new question, rather than an answer on an existing question.
Josh Hinman