views:

31

answers:

1

First of, I'm doing this for a web crawler (aka spider aka worm...)

Given two strings (base url and relative url), I need to determine the absolute url. It is especially confusing when it comes to "SEO friendly" crap, such as:

Base url: http://aaa.com/january/15/test Found url: /test.php?aaa

How would I know that the above aren't folders or not? Eg; the absolute path would be:

http://aaa.com/january/15/test/test.php?aaa

Or:

http://aaa.com/january/15/test.php?aaa

?

The confusion stems from whether there is an index in action or not. "/test/index.php" or "/index.php"?

+1  A: 

You can't solve this problem by examining only the URL.

You say you need the absolute URL given a base URL and relative URL. The full URL is the concatenation of the base URL and relative URL. As you've seen, knowing this doesn't help one bit.

http://example.com/directory/index.php and http://example.com/directory/ can legitimately refer to two different resources.

http://example.com/directory/index.php and http://example.com/directory/foo/bar/baz.php can legitimately refer to the same ultimate resource.

In the second above example, which is the canonical URL? This is not something that can be necessarily computationally determined. The canonical URL is the one you choose to be the canonical URL.

You're actually facing two problems here:

  1. When do two different URLs refer to the same resource?
  2. Which URL is the canonical URL?

1. When do two different URLs refer to the same resource?

This can't be determined by comparing URLs in any way. This can only be determined by comparing the resource itself i.e. the content and the HTTP headers.

ETag - http://en.wikipedia.org/wiki/HTTP_ETag

In short, the ETag is an HTTP header that is unique for a resource. Its intent is for cache validation i.e. Is the content I have in my cache the same as the content at http://example.com/content?

Two identical resources, at least from the same host, will have the same ETag header value. Use this if possible (not all web servers will return an ETag header).

HTTP header and content comparison

When are two resources identical? When the content type and content are the same.

Compare the content type using the Content-Type header. Comparing the content itself is a simple case of string comparison.

If you're storing properties of previously-found resources and comparing these to newly-found resources you don't need to consider the full text of the resource for the purposes of comparison - a hash will do.

As far as PHP is concerned, the HTTP extension will give you all you need with a very convenient OO API for examining the HTTP headers and full content of a resource. The md5() function is one option for generating a unique hash. There are others.

2. Which URL is the canonical URL?

Pick one and stick with it. By default one URL is no more canonical than another for the same resource. For simplicity, you might consider the shortest of two URLs to be the canonical form.

Jon Cram
Very helpful post, however, how do I concatenate 2 parts correctly without generating a lot of 404s first?
Christian Sciberras
@Christian: I see, I didn't quite fully get your question. Base url: http://aaa.com/january/15/test Found url: /test.php?aaa => Absolute url: http://aaa.com/test.php?aaa
Jon Cram