For a search bot, I am working on a design to:
* compare URIs and
* determine which URIs are really the same page
Dealing with redirects and aliases:
Case 1: Redirects
Case 2: Aliases e.g. www
Case 3: URL parameters e.g. sukshma.net/node#parameter
I have two approaches I could follow, one approach is to explicitly check for redirects to catch case #1. Another approach is to "hard code" aliases such as www, works in Case #2. The second approach (hard-code) aliases is brittle. The URL specification for HTTP does not mention the use of www as an alias (RFC 2616)
I also intend to use the Canonical Meta-tag (HTTP/HTML), but if I understand it correctly - I cannot rely on the tag to be there in all cases.
Do share your own experience. Do you know of a reference white paper implementation for detecting duplicates in search bots?