Detecting CacheBuster querystrings when crawling a page

I've put together a fairly simple crawling engine that works quite well and for the most part avoids getting stuck in circular loop traps. (Ie, Page A links to Page B and Page B links to Page A).

The only time it gets stuck in this loop is when both pages link to each other with a cachebuster querystring, basically it is a unique querystring on each and every link per refresh.

This causes the pages to always look like new pages to the crawler, and the crawler gets stuck moving between the two pages.

Aside from just breaking out after N amount of bounces between two pages with the only difference being the querystring (Which I don't think is a very good approach), is there any other way to detect and break out of these traps...?

Maybe they are just session ids, not "cache busters" --- cookies will help

A few years ago, I had to deal with a similar problem and we found a simple solution: enable cookies in your web client. Here is an explanation why this helped us:

It is true that some URLs (in particular advertisement images) are intended to "bust caches". However, we did not find circular references with them.

Many URLs add a random-looking string to URLs in order to identify visitors. There is no intend to bust caches --- this is just a side effect of their method to get a unique "session identifier" for each visitor.

Of course, it is much better to identify visitors with cookies. In fact, most of the dynamic sites that use session identifiers first try cookies. Only when they find that the web client doesn't support them, the site falls back to adding the session ids to URLs.

Bottom line:

By enabling cookies, we can keep most dynamic sites happy and avoid random strings (session identifiers) in URLs.
Advertisers do use cache busters --- but mostly without circular references.

For us, this solved the problem.

ansaurus

tags:

views:

answers:

Detecting CacheBuster querystrings when crawling a page

related questions