ansaurus

Question

given 2 urls, how to tell that they are actually referring to the same website or webpage?

Answer 1

+4 A:

I really don't think this is possible, given your soccernet example, without actually comparing the output you get from each page.

chrissr 2009-08-17 03:39:50

Answer 2

A:

You cannot determine this, in the general case. http://server1/page.aspx and http://server2/page.aspx could be the same page, if server1 and server2 both map to the same IP address; in fact, if they both map to the same server farm.

In fact, even if they were the same page, they could have completely different contents, if the page renders differently based on the URL used to request it.

John Saunders 2009-08-17 03:40:49

In general though, like you said, you can't determine it. Simply having the same path and IP doesn't mean they are the same file, and similarly, having different IP's and paths doesn't mean they are different.

Matthew Scharley 2009-08-17 03:42:23

Answer 3

+1 A:

The only way is to download each page and compare them.

Really, this shouldn't be too much trouble, since your average HTML file is fairly small (normally well under 100KB's at the most). You don't need to download all the referenced files.

Matthew Scharley 2009-08-17 03:41:06

That may not help, if the page contents depend at all on the URL.

John Saunders 2009-08-17 03:41:57

Offsite pages will have to be refered to offsite on both (one would assume), and relative links should be the same. In a well crafted site, relative links shouldn't include the domain of the site, but I'll admit there are sites like that around... There is no perfect solution to this though.

Matthew Scharley 2009-08-17 03:44:40

Also, if the OP is looking for identical *content*, you *would* have to execute all the Javascript on the page and download all the reference files.

Imagist 2009-08-17 03:46:21

However, I think that this is probably the closest thing to a solution the OP is likely to find.

Imagist 2009-08-17 03:47:05

No, I mean if the page generates different content based on the URL. On one site I dealt with, I believe the login page had a different layout depending on the URL used to retrieve it.

John Saunders 2009-08-17 03:47:43

But John, in that case, then the pages **aren't** the same are they? They have different content...

Matthew Scharley 2009-08-17 03:48:37

Again, the OP needs to define what he means by "the same page"

John Saunders 2009-08-17 03:52:16

Answer 4

A:

possibly you could have a level of certainty that they are the same, you can compare filesize after issuing a HEAD request, although that doesn't give you exactly what you want.

after doing the HEAD request you could get the contents to compare if the filesizes are the same.

Here is some info on doing a HEAD request:

http://www.eggheadcafe.com/tutorials/aspnet/2c13cafc-be1c-4dd8-9129-f82f59991517/the-lowly-http-head-reque.aspx

John Boker 2009-08-17 03:42:33

"Hello world!" and "Goodbye Bob!" have the same content-length. In reality, this alone isn't a very good measure, there's just far too much room for false positives.

Matthew Scharley 2009-08-17 03:47:40

If you've got the contents, why would you compare the filesizes instead of comparing the contents?

Imagist 2009-08-17 03:48:34

I think that was a typo. HEAD requests don't send content, but they do (if the server behaves itself) send a Content-Length header.

Matthew Scharley 2009-08-17 03:49:39

I didnt explain very well - you first compare the file sizes, then if they are the same you compare the content (after sending the GET request) that way you dont have to get the content for every page, just the ones that have the same file sizes.

John Boker 2009-08-17 04:13:38

It's a good optimisation, and if you can take advantage of keep-alives, then it's probably worth it. If you need to make two connections though, then it's probably not.

Matthew Scharley 2009-08-17 04:25:26

hi matthew, thanks for your comments. what about the HTTP HEAD request from ACoolie below? is that good idea? i also edited my question to ask for a level of certainty that i am comfortable with. what factors should i ask for?

keisimone 2009-08-17 06:14:51

Answer 5

A:

soccernet.com and soccernet.espn.go.com are completely different URLs. Its a very specific case when the program would need to HTTP access soccernet.com to notice it redirects to soccernet.espn.go.com. Is it viable for your case?

Havenard 2009-08-17 03:45:52

Answer 6

A:

You can do an HTTP HEAD request to determine if the page is being redirected somewhere else. You could compare the actual response file, but with a website like ESPN even the same url will rarely respond with the same contents, due to tracking javascript and ads.

Use the get_headers() function and recursively follow the 'Location' key. So 'soccernet.com' redirects to 'http://soccernet.espn.go.com/archive/' which redirects to 'http://soccernet.espn.go.com/index'. Ignoring the query string, this url and the other url you have are equivalent.

print_r(get_headers('http://soccernet.espn.go.com/archive/'),1)

ACoolie 2009-08-17 03:46:28

will this work also for the httpS and http situation?

keisimone 2009-08-17 06:12:26

Yes. But, reiterating the problem, the "..." section will be different between the two url's, and even the same url checked twice.http://gmail.com -> http://mail.google.com/mail/ -> https://www.google.com/accounts/...https://gmail.com -> https://mail.google.com/mail/ -> https://www.google.com/accounts/...

ACoolie 2009-08-17 14:25:15

Answer 7

A:

Maybe cURL is your friend. It can follow redirects like this.

fabrik 2009-08-17 08:06:43

ansaurus

tags:

views:

answers:

given 2 urls, how to tell that they are actually referring to the same website or webpage?

related questions