views:

431

answers:

11

What are some of techniques good for detecting if a webpage is the same?

By same, I don't mean char-for-char equivalent (that's easy), but is robust enough to ignore something like a current date/time on the page, etc.

E.g., go a Yahoo! News article load the page, open the same page 10 minutes later in another browser. Baring rewrites, those pages will have some differences (some ads, possibly things like ads, possibly things like related stories), but a human could look at the two and say they're the same.

Note I'm not trying to fix (or rely) on URL normalization. I.e., figuring out that foo.html & foo.html?bar=bang are the same.

A: 

Without intimate knowledge of the structure of the pages you're trying to compare, then this could be very tricky. That is, how is a machine supposed to tell that a page with a couple of different pictures is the same - if it's a news site with ads then it should be the same, but if it's a photographer's portfolio, then it's definitely different.

If you do know the structure of the page, then what I'd do is manually select portions of the page (using IDs, CSS selectors, XPath, etc) to compare. For example, only compare the #content divs between page refreshes. From there, you might need to add a tolerance level to a char-by-char comparison.

There's a service which does a similar thing, actually. It's called Rsspect (written by Ryan North of Qwantz fame), which will detect changes to any website and create an RSS feed out of it, even if you don't control the page.

nickf
A: 

You could generate a MD5 hash of each of them, then compare that. Like you said, easy enough.

What you're looking for is a technique for comparing two pages that have arbitrary elements that can change. It's a hard problem.

  1. Identify the areas in a page which can change and you don't care about. Careful! They will always move around.
  2. Hash or do some checksum of the DOM of just the parts of the page you DO care about. Careful! These also will always be changing.

You are up against the first rule of screen scraping: The page is inherently volatile. So it's a tough problem. Your solution will NEVEr be robust enough to account for the infinite variety of subtle changes your source data will be subject to, unless you also have direct control over the source pages and can design your solution against that.

Good luck! I've had experience with systems that tried to solve this problem and it's indeed a tough nut to crack.

Genericrich
Hashing will only get you so far b/c it's a binary difference; either they hash the same or they don't. Whereas other measures mentioned above (cosine similarity, etc.) measure more precisely *how* close the pages are. Dealing with web stuff, that's probably the realm you want to be in.
Ian Varley
A: 

The way to do this is to not compare the whole page, because as you say a Human wouldn't be tricked by that either. Say you are interested in the news articles of a Yahoo! page, so then you should look just at the news section. Then you can do whatever, a hash or a literal comparison between the new and old version.

Robert Gould
+1  A: 

I use vgrep for that sort of stuff.

It's a little known tool called visual-grep which relies on advanced technology like the sapient ocular device and visual cortex for very quickly determining the sameness of pages side-by-side, and it's remarkably accurate and efficient (it ought to be since it's been under development for quite a long time).

Marking community wiki in case the humor police are out today :-).

paxdiablo
The humor police should so down-vote you for the lameness of this joke ;)
Robert Gould
+1. Too bad you community wiki-ed. =)
A. Rex
If I hadn't, I suspect I'd be -20 by now. Most SO'ers (myself included) seem to frown on humor in answers.
paxdiablo
You could even use Googles' server farm to do it - http://www.google.com/technology/pigeonrank.html
Pete Kirkham
+8  A: 

It sounds like you are after a robust way to measure the similarity of two pages.

Given that the structure of the page won't change that much, we can reduce the problem to testing whether the text on the page is roughly the same. Of course, with this approach the problems alluded to by nickf regarding a photographers page are still there but if you are mainly concerned with Yahoo! news or the like this should be okay.

To compare to pages, you can use a method from machine learning called "string kernels". Here's an early paper a recent set of slides on a R package and a video lecture.

Very roughly, a string kernel looks for how many words, pairs of words, triples of words, etc two documents have in common. If A and B are two documents and k is a string kernel then the higher the value of k(A,B) the more similar the two documents.

If you set a threshold t and only say two documents are the same for k(A,B) > t you should have a reasonably good way of doing what you want. Of course, you'll have to tune the threshold to get the best results for your application.

Mark Reid
+3  A: 

You can detect that two pages are the same by using some sort of similarity metric such as the cosine similarity. Then you would have to define a minimum threshold that you can use to accept whether the two documents are the same. For example, I would pick a value closest to 1 when applying the cosine measure, since it ranges from -1 for totally different and 1 for identical.

Marcel Tjandraatmadja
A: 

The first thought that came into my head was to process the pages into XML documents with BeautifulSoup (Python), run a diff on them, and count the number of lines different. If the count is > X%, they are different. Not very robust and probably prone to error, but that'd be the quick hack I'd do for testing.

You might want to have a look at this page which discusses comparing two XML documents:
http://www.ibm.com/developerworks/xml/library/x-diff/index.html

An html document can be coerced into an XML document with beautiful soup then compared using the techniques listed there.

Josh Smeaton
+1  A: 

You could use a web browser component to render a screenshot of the two pages, and then compare the images. Might be the simplest option.

amdfan
A: 

I had a similar problem. I was trying to devise a safe linking system for a directory of user submitted links. A user would publish a page on a blog or news site and submit the link to the index. A human would verify the link to be appropriate then add the page into the index.

The problem was to come up with a way to automate checks that ensured the link was still appropriate over time. For instance, did someone modify the page weeks later and insert racial slurs? Did the news site start telling people 'you must subscribe to read this story'?

I ended up extracting paragraph <p> elements and comparing the cached copy to the current word for word. In simplest terms:

cached[] = { "Lorem", "Ipsum", "..." };
scanned[] = { "Lorem, "foo", ... };

After that, a series of sorters would work on it while ignoring common words 'if but can or and' while treating other words (profanity, etc) with a heavier weight.

This resulted in a scoring system that would all but ignore minor edits and revisions (typos, sentence structure, etc) but quickly reveal if the content needed to be examined again. A score was then returned, scores above a threshold would be put in a queue for a human to re-verify.

This also helped to account for major cosmetic changes to the site. I would not trust it to run completely on its own, but it did do its job predictably well with a little help from humans. Admittedly, the system was not as efficient as it could have been as far as the methodology goes.

Tim Post
+2  A: 

Depending on what you're doing, you might be interested in TemplateMaker. You give it some strings (such as web pages) and it marks out the bits that change.

In your Yahoo! News example, you'd fetch the page once and tell TemplateMaker to learn it. Then you'd fetch it again and tell it to learn that one.

When you were happy that your TemplateMaker knew what was the same every time, you could fetch another page and ask TemplateMaker whether it matched the template from the others. (It would give you the pieces that had changed, if you were interested in that.)

RJHunter
A: 

You could try to use HTTP headers like If-Modified-Since, or some other cache-related headers. Also, it may be helpful to look at a site-map file to see how often search engines are expected to check back.

My other attempt (maybe to be used in conjunction) would be to make a list of all ids and classes found in divs on the page. If these lists don't match up, it's likely that there's been a reasonably noticeable change. Otherwise, they're probably very similar.

EDIT: You might also compare the srcs of img elements.

stalepretzel