efficient web parsing approach - aggregation issue

Hi.

A number of sites do aggregation (indeed.com, simplthired.com,expedia...) I'm trying to figure out a good/efficient way of determining that the data I get from parsing data from a page is valid. In particular, if I parse a page multiple times, (say once a day) how do I 'know' that the data i get on any given time is valid?

I'm considering an approach where I have the child process do two separae parse passes to the target page. each parse/pass would be on a separate box (or on the same box, but different times). the process would then compare the results, with the intent being to use the results if the results match, otherwise, the app repeats the process if the results are different..

for my situation, i can't just take a snapshot of the html, and compare, because the time/datestamp/etc.. might change.. so i do a data fetch, and the parse/crawl app will use the resulting data to compare..

if the results are the same, the app asssumes the resulting parse is valid, and the results then become the valid data in the main data trunk of the aggregation process... the bulk of this process could/would occur on the client side, which would reduce the amount of data being xfered between the client/server..

i've been researching around for tech papers/abstracts/etc.. on this kind of processing.. with no real luck...

thoughts/comments are appreciated..

or, if someone has pointers to someone who actually works as an architect at one of the companies who does this kind of thing, that i could talk to!!! that would be solid!

thanks

tom..

ansaurus

tags:

views:

answers:

efficient web parsing approach - aggregation issue

related questions