views:

46

answers:

2

For a search bot, I am working on a design to:
* compare URIs and
* determine which URIs are really the same page

Dealing with redirects and aliases:
Case 1: Redirects
Case 2: Aliases e.g. www
Case 3: URL parameters e.g. sukshma.net/node#parameter

I have two approaches I could follow, one approach is to explicitly check for redirects to catch case #1. Another approach is to "hard code" aliases such as www, works in Case #2. The second approach (hard-code) aliases is brittle. The URL specification for HTTP does not mention the use of www as an alias (RFC 2616)

I also intend to use the Canonical Meta-tag (HTTP/HTML), but if I understand it correctly - I cannot rely on the tag to be there in all cases.

Do share your own experience. Do you know of a reference white paper implementation for detecting duplicates in search bots?

A: 

Building your own web crawler is a lot of work. Consider checking out some of the open source spiders already available, like JSpider, OpenWebSpider or many others.

cxfx
I get where you are going, however - I need the technology and know-how for duplicated detection (and not just for crawls). Would you know if these projects have resolved that successfully?
Santosh
Despite my own advice, I've built my own crawler and stored a checksum for every crawled page. If a page was a potential duplicate of another, based on its URL or other criteria, then I compared the checksums to check.
cxfx
A: 

The first case would be solved by simply checking the HTTP status code.

For the 2nd and 3rd cases Wikipedia explains it very well: URL Normalization / Canonicalization.

Alix Axel