Compare URIs for a search bot?

views:

answers:

+2 Q:

Compare URIs for a search bot?

For a search bot, I am working on a design to:
* compare URIs and
* determine which URIs are really the same page

Dealing with redirects and aliases:
Case 1: Redirects
Case 2: Aliases e.g. www
Case 3: URL parameters e.g. sukshma.net/node#parameter

I have two approaches I could follow, one approach is to explicitly check for redirects to catch case #1. Another approach is to "hard code" aliases such as www, works in Case #2. The second approach (hard-code) aliases is brittle. The URL specification for HTTP does not mention the use of www as an alias (RFC 2616)

I also intend to use the Canonical Meta-tag (HTTP/HTML), but if I understand it correctly - I cannot rely on the tag to be there in all cases.

Do share your own experience. Do you know of a reference white paper implementation for detecting duplicates in search bots?

Building your own web crawler is a lot of work. Consider checking out some of the open source spiders already available, like JSpider, OpenWebSpider or many others.

cxfx 2009-12-11 03:54:45

I get where you are going, however - I need the technology and know-how for duplicated detection (and not just for crawls). Would you know if these projects have resolved that successfully?

Santosh 2009-12-11 06:02:20

Despite my own advice, I've built my own crawler and stored a checksum for every crawled page. If a page was a potential duplicate of another, based on its URL or other criteria, then I compared the checksums to check.

cxfx 2009-12-11 06:22:30

The first case would be solved by simply checking the HTTP status code.

For the 2nd and 3rd cases Wikipedia explains it very well: URL Normalization / Canonicalization.

Alix Axel 2010-01-15 01:08:23

ansaurus

tags:

views:

answers:

Compare URIs for a search bot?

related questions