You can't solve this problem by examining only the URL.
You say you need the absolute URL given a base URL and relative URL. The full URL is the concatenation of the base URL and relative URL. As you've seen, knowing this doesn't help one bit.
http://example.com/directory/index.php
and
http://example.com/directory/
can legitimately refer to two different resources.
http://example.com/directory/index.php
and http://example.com/directory/foo/bar/baz.php
can legitimately refer to the same ultimate resource.
In the second above example, which is the canonical URL? This is not something that can be necessarily computationally determined. The canonical URL is the one you choose to be the canonical URL.
You're actually facing two problems here:
- When do two different URLs refer to the same resource?
- Which URL is the canonical URL?
1. When do two different URLs refer to the same resource?
This can't be determined by comparing URLs in any way. This can only be determined by comparing the resource itself i.e. the content and the HTTP headers.
ETag - http://en.wikipedia.org/wiki/HTTP_ETag
In short, the ETag is an HTTP header that is unique for a resource. Its intent is for cache validation i.e. Is the content I have in my cache the same as the content at http://example.com/content?
Two identical resources, at least from the same host, will have the same ETag header value. Use this if possible (not all web servers will return an ETag header).
HTTP header and content comparison
When are two resources identical? When the content type and content are the same.
Compare the content type using the Content-Type header. Comparing the content itself is a simple case of string comparison.
If you're storing properties of previously-found resources and comparing these to newly-found resources you don't need to consider the full text of the resource for the purposes of comparison - a hash will do.
As far as PHP is concerned, the HTTP extension will give you all you need with a very convenient OO API for examining the HTTP headers and full content of a resource. The md5() function is one option for generating a unique hash. There are others.
2. Which URL is the canonical URL?
Pick one and stick with it. By default one URL is no more canonical than another for the same resource. For simplicity, you might consider the shortest of two URLs to be the canonical form.