views:

206

answers:

1

I want to grab text from a list of web pages. I've done a bit of experimenting and found that the best way for my needs is via WebKit.

Once the source of the page has been grabbed, I want to strip out all the HTML tags, by using the technique in this comment.

Here's my code:

- (void)webView:(WebView *)sender didFinishLoadForFrame:(WebFrame *)frame {
    if(frame == [sender mainFrame]) {
        NSString *content = [[[[sender mainFrame] dataSource] representation] documentSource];
        NSXMLDocument *theDocument = [[NSXMLDocument alloc] initWithXMLString:content options:NSXMLDocumentTidyHTML error:&theError];
        NSString *theXSLTString = @"<?xml version='1.0' encoding='utf-8'?>\n<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform' xmlns:xhtml='http://www.w3.org/1999/xhtml'&gt;\n&lt;xsl:output method='text'/>\n<xsl:template match='xhtml:head'></xsl:template>\n<xsl:template match='xhtml:script'></xsl:template>\n</xsl:stylesheet>";
        NSData *theData = [theDocument objectByApplyingXSLTString:theXSLTString arguments:nil error:&theError];
        NSString *theString = [[NSString alloc] initWithData:theData encoding:NSUTF8StringEncoding];
    }
}

This works fine on most pages. However, if a page doesn't validate correctly as XHTML, I sometimes get an error from my initWithXMLString: method.

That's fair enough - I'm asking it to tidy up the XHTML, so I'd expect it to report what problems it's encountered. But if there's a problem with the validation, it returns nil and an error rather than actually tidying up the XHTML.

One specific page that's causing the problem is the Ruby class documentation.

I've found that the excellent third party HTML tidy application can clean up this XHTML fine, but I'd expect NSXMLDocumentTidyHTML to be able to just add some quotes around cellpadding values. It's a fairly basic cleanup operation. And I'm not keen to add another dependency into my code base.

Is there something I'm missing with the way Cocoa cleans up XHTML? Or do I just need to bite the bullet and use HTML Tidy instead in my code?

+1  A: 

XHTML documents are treated as XML, so you may have better luck with the NSXMLDocumentTidyXML flag.

Ben Stiglitz
Worth noting that they're not mutually exclusive. You can use NSXMLDocumentTidyHTML | NSXMLDocumentTidyXML to get both behaviors together. TidyXML fixes up invalid XML to be valid; TidyHTML makes the document's string values easier to read.
Peter Hosey
Thanks a lot chaps. Really helpful. I tried NSXMLDocumentTidyHTML and NSXMLDocumentTidyXML seperately, but in my frustration forgot to try them together. This did the trick! No more reliance on HTML Tidy for me. Marvellous.
John Gallagher