I'm doing multiple levels of parsing of web pages where I use information from one page to drill down and grab a "lower" page to parse. When I get to the lowest level of my hierarchy, I no longer hit a new page, I basically hit the same one (with different parameters) and make SQL database entries.
If I don't slow things down (by putting a sleep(1)) before that inner loop, initWithContentsOfURL eventually returns a kind of stub piece of HTML. Here's the code I use to get my HTML nodes:
NSError *err = nil;
NSString* webStringURL = [sURL stringByAddingPercentEscapesUsingEncoding: NSUTF8StringEncoding];
NSData *contentData = [[[NSData alloc] initWithContentsOfURL: [NSURL URLWithString: webStringURL]
options: 0
error: &err] autorelease];
NSString *dataString = [[[NSString alloc] initWithData: contentData
encoding: NSISOLatin1StringEncoding] autorelease];
NSData *data = [dataString dataUsingEncoding: NSUTF8StringEncoding];
TFHpple *xPathDoc = [[[TFHpple alloc] initWithHTMLData: data] autorelease];
It works fine with 4 levels of looping. In faxt, it can run 24/7 with no real memory leak problem. It only dies when I have a connection issue. That is as long as I put in the sleep(1) before the inner-most loop.
It's like it's too fast and initWithContentsOfURL can't keep up. I suppose I could try to do something asynchronous but this is not for user-consumption and the direct synchronous looping works just fine... almost. I've tried different ways of slowing things down. Pausing for one second on a regular basis works but if I take that out, it starts getting bogus data after about 10 times through the inner loop. Is there a way to handle this properly?