tags:

views:

138

answers:

4

If I have URL A say http://www.example.com/ and another, say http://www.example.com. What would be the safest way to determine if both is the same, without querying for the web page and do a diff?

EXAMPLES:

  1. http://www.example.com/ VS http://www.example.com (Mentioned above)
  2. http://www.example.com/aa/../ VS http://www.example.com

EDIT: Clarifications: Just want to know if the URLs are the same in the context of being equivalent according to the RFC 1738 standard.

+12  A: 

In .Net, you can use the System.Uri class.

let u1 = new Uri("http://www.google.com/");;

val u1 : Uri = http://www.google.com/

let u2 = new Uri("http://www.google.com");;

val u2 : Uri = http://www.google.com/

u1.Equals(u2);;

val it : bool = true

For more fine-grained comparison, you can use the Uri.Compare method. There are also static methods to deal with various forms of escaping and encoding of characters in the Uri string, which will no doubt prove useful when dealing with the subject thoroughly.

codekaizen
what language is that ? IronPython I guess ?
Thomas Levesque
@Thomas Levesque: it looks more like an interactive F# session
dtb
Right, F# interactive... I was lazy and didn't want to create a C# project just to make sure the answer was correct. I thought it was clear enough and might even help get eyes a bit more accustomed to seeing F#.
codekaizen
@codekaizen LINQPad is your friend http://www.linqpad.net/
Graphain
+1  A: 

There is very little you can do without requesting the URL. But you can define several heuristics:

  1. Remove trailing slashes
  2. Consider .htm and .html the same
  3. Assume /base/ and /base/index.html are the same
  4. Remove query string parameters (maybe, maybe not, depends on your needs)
  5. Consider url.com and www.url.com the same.

It is all very dependent on what exactly you mean by URLs which are the "same".

Yuval A
There are also issues with host names and more, like www.example.com versus example.com, or upper case versus lower case (which in Windows does not matter but in other platforms they are different), default documents, trailing slash, etc.So my answer would be, it really depends on what "same" means, and if your worry is the content, then you need to request them.if you want to find duplicates the IIS SEO Toolkit can help: http://www.iis.net/download/SEOToolkit
CarlosAg
None of these heuristics are correct. They may often prove true but in general different path components mean different web pages.
George Phillips
@George - again, it is imperative to define what exactly "same" URLs means. Of course, in different cases these heuristics might not be useful.
Yuval A
A: 

There are few things to add to Yuval A answers:

  • www.google.com and http://www.google.com may points to the same target
  • www.google.com and google.com points to the same page (but it is implemented by redirecting)
  • Url may be encoded (see HttpUtility.UrlEncode / Decode methods)
STO
+1  A: 

For the benefit of those of you who don't know F#, here's a quick and dirty but complete C# console app that demonstrates the use of the Uri class to tell if two URLs are the same. When you run this code, you should see two lines: "true", followed by "false":

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine(IsSameUrl("http://stackoverflow.com/", "http://stackoverflow.com").ToString());
            Console.WriteLine(IsSameUrl("http://stackoverflow.com/", "http://codinghorror.com").ToString());
            Console.ReadKey();
        }

        static bool IsSameUrl(string url1, string url2)
        {
            Uri u1 = new Uri(url1);
            Uri u2 = new Uri(url2);
            return u1.Equals(u2);
        }
    }
}
Joey deVilla