views:

219

answers:

3

Hi.

In the process of high level design for a targeted crawler/parser. The app will be used to extract data from specific websites. Furhtermore, the app is being designed to run in a master/slave process, where the master/server side processes the packets to be parsed, and then allows the child nodes (client servers) in the system to fetch the batch of packets to be parsed. (Xpath is used in the parsing process to exttract the data for each page of the parsed site.)

I'm posting here, as I'm wondering about efficient implementations to ensure that the data the app fetches is correct. I'm considering implementing a process where I do at least two runs across the targeted sites, and if the results are different, do a 3rd run, and then use whichever two runs match, throughing an err, if the app gets a different result for all 3 runs...

However, this gets reaaly inefficient, and seriously cranks up the bandwodth/processing...

The reason I believe I need to do multiple runs, is because the underlying data/site will change day to day.. But I want to be able to "stop" the subsequent run as soon as possible, if the app can determine that the underlying data on the page hasn't changed..

So.. I'm basically asking if anyone has pointers to any kind of docs/articles/thoughts/etc.. on how this issue can/has been solved.. I'm thinking that there are people/apps who've solved this. IE, a site like simplyhired/indeed where you need to scrape underlying job sites, and ensure that the data that you get is correct, have solved this kind of thing...

Hope this all makes sense! (I've got more, but tried to keep it short here..)

Thanks

Tom

+1  A: 

I don't see the point in doing multiple runs for the same site.

TCP/IP GUARANTEES correct transfering of data. If there is an error you will get the error from your TCP/IP stack. Then retrying makes sense. And if the server would send wrong data there is no real hope that just calling it 3 times would somehow improve the situation.

Also most sites may be dynamic. So it is virtually impossible that you get the exactly same result twice.

Foxfire
hi... the rationale for multiple runs. i'm doing a parse of college class schedules, going from registrar, to school, to dept, to class,any/all of these pages could be changed by the college, so the app needs to continually reparse. however, if i determine that the 'page; is the same as the already cached/fetched version, the app could then use the current version of the page, to generate the subsequent packets to parse the subsequent pages from the site.
tom smith
I was replying on the "I'm posting here, as I'm wondering about efficient implementations to ensure that the data the app fetches is correct. I'm considering implementing a process where I do at least two runs across the targeted sites, and if the results are different, do a 3rd run, and then use whichever two runs match, throughing an err, if the app gets a different result for all 3 runs..." part. It should be enough to just use the latest result.
Foxfire
A: 

Well the first step is to rely on the HTTP Caching headers. That tells you if the page has changed at all.

Not all sites cache friendly, but many are.

Once past that, you're kind of out of luck as you need to parse the page just to get the data to see if it's changed. You can skip any post processing at that point, but you still have to eat the fetching and parsing phase, which are likely the costliest part anyway.

Will Hartung
A: 

Why build yet another crawler? There are plenty of very good implementations that have already worked out how:

  • not to overload servers, getting you a ban
  • to retry according the different failure modes
  • to maximize bandwidth
  • to avoid infinite loops in the fetching
  • and a lot of other considerations

You can integrate your software with these existing crawlers and be happy. Or, if you want to do more work, you can probably embed them into your app (may be harder than it looks, great crawlers are very complex beasts.)

Some of these are:

Vinko Vrsalovic