Hi.
In the process of high level design for a targeted crawler/parser. The app will be used to extract data from specific websites. Furhtermore, the app is being designed to run in a master/slave process, where the master/server side processes the packets to be parsed, and then allows the child nodes (client servers) in the system to fetch the batch of packets to be parsed. (Xpath is used in the parsing process to exttract the data for each page of the parsed site.)
I'm posting here, as I'm wondering about efficient implementations to ensure that the data the app fetches is correct. I'm considering implementing a process where I do at least two runs across the targeted sites, and if the results are different, do a 3rd run, and then use whichever two runs match, throughing an err, if the app gets a different result for all 3 runs...
However, this gets reaaly inefficient, and seriously cranks up the bandwodth/processing...
The reason I believe I need to do multiple runs, is because the underlying data/site will change day to day.. But I want to be able to "stop" the subsequent run as soon as possible, if the app can determine that the underlying data on the page hasn't changed..
So.. I'm basically asking if anyone has pointers to any kind of docs/articles/thoughts/etc.. on how this issue can/has been solved.. I'm thinking that there are people/apps who've solved this. IE, a site like simplyhired/indeed where you need to scrape underlying job sites, and ensure that the data that you get is correct, have solved this kind of thing...
Hope this all makes sense! (I've got more, but tried to keep it short here..)
Thanks
Tom