views:

381

answers:

2

How do you get a DOMDocument from a given HTML string using WebKit? In other words, what's the implementation for DOMDocumentFromHTML: for something like the following:

NSString * htmlString = @"<html><body><p>Test</body></html>";
DOMDocument * document = [self DOMDocumentFromHTML: htmlString];

DOMNode * bodyNode = [[document getElementsByTagName: @"body"] item: 0];
// ... etc.

This seems like it should be straightforward to do, yet I'm still having trouble figuring out how :( ...

A: 

According to what I can derive from another answer on this site, there is no synchronous method such as my requested DOMDocumentFromHTML: available in WebKit.

So far, the best I've been able to do is the following asynchronous combination of giveDOMDocumentFromHTML:usingBaseURL: and takeDOMDocument:.

- (void) giveDOMDocumentFromHTML: (NSString *) htmlString
         usingBaseURL: (NSURL *) baseURL
{
    WebView * webView = [[WebView alloc] init];
    [webView setFrameLoadDelegate: self];
    [[webView mainFrame] loadHTMLString: htmlString
                         baseURL: baseURL];
}

- (void) takeDOMDocument: (DOMDocument *) document
{
    DOMHTMLElement * bodyNode =
        (DOMHTMLElement *) [[document getElementsByTagName: @"body"] item: 0];
    NSLog(@"Body is: %@", [bodyNode innerHTML]);
}

They are hooked together through the following delegate method:

- (void) webView: (WebView *) webView
         didFinishLoadForFrame: (WebFrame *) frame
{
    if (frame == [webView mainFrame]) {
        [self takeDOMDocument: [frame DOMDocument]];
    }
}

The above works, but has at least the following remaining issues:

  • I'm not sure where the allocated WebView should be sent a release or autorelease message.
  • I would prefer/need the application to remain blocked until the HTML page has been processed. In the above scheme the application will be processing any user input while the WebView is loading/parsing the HTML. (Note that the WebView will never be shown on screen.)

So this is still very much up for improvement. Anyone who can provide a synchronous implementation for DOMDocumentFromHTML: as outlined in the original question?

Rinzwind
+2  A: 

Not an actual answer to the question, but I've now concluded that WebKit and DOMDocument are likely not the most appropriate tools for what I want to do; which is to process an HTML document that is not shown to the user. The class NSXMLDocument straightforwardly and synchronously supports turning an HTML document into a manipulable object structure:

NSError * error = nil;
NSString * htmlString = @"<html><body><p>Test</body></html>";

NSXMLDocument * doc =
  [[NSXMLDocument alloc]
     initWithXMLString: htmlString
     options: NSXMLDocumentTidyHTML
     error: &error];
NSLog(@"Error is: %@", error);
NSLog(@"Doc is: %@", doc);
NSLog(@"Root element is: %@", [doc rootElement]);
NSLog(@"Root element's children are: %@", [[doc rootElement] children]);
Rinzwind