ansaurus

Question

Compare two websites and see if they are "equal?"

Answer 1

A:

Copy the files to the same server in /tmp/directory1 and /tmp/directory2 and run the following command:

diff -r /tmp/directory1 /tmp/directory2

For all intents and purposes, you can put them in your preferred location with your preferred naming convention.

Edit 1

You could potentially use lynx -dump or a wget and run a diff on the results.

Warner 2010-07-19 21:34:58

That would compare the files themselves would it not? I want to compare the rendered pages, after they have run through apache (and PHP). I think I am looking for a web spider or crawler of some sort.

Josh 2010-07-19 21:37:46

Answer 2

+2 A:

The catch is how to check the 'rendered' pages. If the pages don't have any dynamic content the easiest way to do that is to generate hashes for the files using a md5 or sha1 commands and check then against the new server.

IF the pages have dynamic content you will have to download the site using a tool like wget

wget --mirror http://thewebsite/thepages

and then use diff as suggested by Warner or do the hash thing again. I think diff may be the best way to go since even a change of 1 character will mess up the hash.

2010-07-19 21:42:30

I was editing before I saw your answer. You provide a good recommendation.

Warner 2010-07-19 21:44:29

Answer 3

A:

Short of rendering each page, taking screen captures, and comparing those screenshots, I don't think it's possible to compare the rendered pages.

However, it is certainly possible to compare the downloaded website after downloading recursively with wget.

  wget [option]... [URL]...

   -m
   --mirror
       Turn on options suitable for mirroring.  This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP
       directory listings.  It is currently equivalent to -r -N -l inf --no-remove-listing.

The next step would then be to do the recursive diff that Warner recommended.

Jeff McJunkin 2010-07-19 21:45:10

Answer 4

+2 A:

Get the formatted output of both sites (here we use w3m, but lynx can also work):

w3m -dump http://google.com 2>/dev/null > /tmp/1.html
w3m -dump http://google.de 2>/dev/null > /tmp/2.html

Then use wdiff, it can give you a percentage of how similar the two texts are.

wdiff -nis /tmp/1.html /tmp/2.html

It can be also easier to see the differences using colordiff.

wdiff -nis /tmp/1.html /tmp/2.html | colordiff

Excerpt of output:

Web Images Vidéos Maps [-Actualités-] Livres {+Traduction+} Gmail plus »
[-iGoogle |-]
Paramètres | Connexion

                           Google [hp1] [hp2]
                                  [hp3] [-Français-] {+Deutschland+}

           [                                                         ] Recherche
                                                                       avancéeOutils
                      [Recherche Google][J'ai de la chance]            linguistiques


/tmp/1.html: 43 words  39 90% common  3 6% deleted  1 2% changed
/tmp/2.html: 49 words  39 79% common  9 18% inserted  1 2% changed

(he actually put google.com into french... funny)

The common % values are how similar both texts are. Plus you can easily see the differences by word (instead of by line which can be a clutter).

Weboide 2010-07-19 22:26:53

ansaurus

tags:

views:

answers:

Compare two websites and see if they are "equal?"

related questions