tags:

views:

106

answers:

7

i am using Php.

given 2 urls like this, http://soccernet.com and http://soccernet.espn.go.com/index?cc=4716

how to tell that they are actually the same?

also consider situation where the difference is the httpS, like https://gmail.com and http://gmail.com

please advise. I am finding it a struggle at using regex because sometimes it is not very good for differentiating for eg, the soccernet example.

i am open to all sorts of possible good ideas and not limiting myself to just regex.

Edit: thanks for all the comments and answers below. how about a good idea for acquiring a level of certainty? what factors should i look for? how do i go about it in the most efficient way?

+4  A: 

I really don't think this is possible, given your soccernet example, without actually comparing the output you get from each page.

chrissr
A: 

You cannot determine this, in the general case. http://server1/page.aspx and http://server2/page.aspx could be the same page, if server1 and server2 both map to the same IP address; in fact, if they both map to the same server farm.

In fact, even if they were the same page, they could have completely different contents, if the page renders differently based on the URL used to request it.

John Saunders
In general though, like you said, you can't determine it. Simply having the same path and IP doesn't mean they are the same file, and similarly, having different IP's and paths doesn't mean they are different.
Matthew Scharley
+1  A: 

The only way is to download each page and compare them.

Really, this shouldn't be too much trouble, since your average HTML file is fairly small (normally well under 100KB's at the most). You don't need to download all the referenced files.

Matthew Scharley
That may not help, if the page contents depend at all on the URL.
John Saunders
Offsite pages will have to be refered to offsite on both (one would assume), and relative links should be the same. In a well crafted site, relative links shouldn't include the domain of the site, but I'll admit there are sites like that around... There is no perfect solution to this though.
Matthew Scharley
Also, if the OP is looking for identical *content*, you *would* have to execute all the Javascript on the page and download all the reference files.
Imagist
However, I think that this is probably the closest thing to a solution the OP is likely to find.
Imagist
No, I mean if the page generates different content based on the URL. On one site I dealt with, I believe the login page had a different layout depending on the URL used to retrieve it.
John Saunders
But John, in that case, then the pages **aren't** the same are they? They have different content...
Matthew Scharley
Again, the OP needs to define what he means by "the same page"
John Saunders
A: 

possibly you could have a level of certainty that they are the same, you can compare filesize after issuing a HEAD request, although that doesn't give you exactly what you want.

after doing the HEAD request you could get the contents to compare if the filesizes are the same.

Here is some info on doing a HEAD request:

http://www.eggheadcafe.com/tutorials/aspnet/2c13cafc-be1c-4dd8-9129-f82f59991517/the-lowly-http-head-reque.aspx

John Boker
"Hello world!" and "Goodbye Bob!" have the same content-length. In reality, this alone isn't a very good measure, there's just far too much room for false positives.
Matthew Scharley
If you've got the contents, why would you compare the filesizes instead of comparing the contents?
Imagist
I think that was a typo. HEAD requests don't send content, but they do (if the server behaves itself) send a Content-Length header.
Matthew Scharley
I didnt explain very well - you first compare the file sizes, then if they are the same you compare the content (after sending the GET request) that way you dont have to get the content for every page, just the ones that have the same file sizes.
John Boker
It's a good optimisation, and if you can take advantage of keep-alives, then it's probably worth it. If you need to make two connections though, then it's probably not.
Matthew Scharley
hi matthew, thanks for your comments. what about the HTTP HEAD request from ACoolie below? is that good idea? i also edited my question to ask for a level of certainty that i am comfortable with. what factors should i ask for?
keisimone
A: 

soccernet.com and soccernet.espn.go.com are completely different URLs. Its a very specific case when the program would need to HTTP access soccernet.com to notice it redirects to soccernet.espn.go.com. Is it viable for your case?

Havenard
A: 

You can do an HTTP HEAD request to determine if the page is being redirected somewhere else. You could compare the actual response file, but with a website like ESPN even the same url will rarely respond with the same contents, due to tracking javascript and ads.

Use the get_headers() function and recursively follow the 'Location' key. So 'soccernet.com' redirects to 'http://soccernet.espn.go.com/archive/' which redirects to 'http://soccernet.espn.go.com/index'. Ignoring the query string, this url and the other url you have are equivalent.

print_r(get_headers('http://soccernet.espn.go.com/archive/'),1)
ACoolie
will this work also for the httpS and http situation?
keisimone
Yes. But, reiterating the problem, the "..." section will be different between the two url's, and even the same url checked twice.http://gmail.com -> http://mail.google.com/mail/ -> https://www.google.com/accounts/...https://gmail.com -> https://mail.google.com/mail/ -> https://www.google.com/accounts/...
ACoolie
A: 

Maybe cURL is your friend. It can follow redirects like this.

fabrik