views:

27

answers:

1

hello,

I am trying to get the summary of an article and download it as a string. This works great with some articles, but the wikipedia website is inconsistent. So NSScanner fails pretty often while it works fine for other articles.

Here's my NSScanner implementation:

NSString *separatorString = @"<table id=\"toc\" class=\"toc\">";                                 
NSScanner *aScanner = nil;
NSString *container = nil;
NSString *muString = [NSString stringWithString:@"</table>"];

aScanner = [NSScanner scannerWithString:string];  
[aScanner setScanLocation:0];                                                   
[aScanner scanUpToString:muString intoString:nil];           
[aScanner scanString:muString intoString:nil];    

[aScanner scanUpToString:separatorString intoString:&container];

How could this be improved? Or is there another way of getting this?

To visualize which bit of the article I want, here's an example:

http://en.wikipedia.org/wiki/Indigo

from this I'd want everything from "Indigo is the color on the electromagnetic spectrum" to "in English was in 1289".

Thanks!

+1  A: 

You could use WebKit's DOM API to walk the actual structure, rather than trying to parse the text blindly.

Joshua Nozzi
That's not a good idea because the wiki pages are waaay too inconsistent.
David Schiefer
First, they're consistent enough that there are a half-dozen apps out there that parse them and present them beautifully on the iPhone and iPad. Second, if using the document's DOM is a bad idea because it's inconsistent, then using NSScanner is at least as bad. At any rate, they look pretty consistent to me. The first p element in the "bodyContent" div. I've spot-checked several articles and they all follow that form. Easy with DOM.
Joshua Nozzi
David Schiefer: The DOM is a much more reliable way to examine these “inconsistent” pages. Consider that with the DOM, you can get the #toc element *wherever and however* it exists. You simply cannot do that with NSScanner.
Peter Hosey