Identifying if 2 HTML pages are similar

views:

186

answers:

+3 Q:

Identifying if 2 HTML pages are similar

I'm trying to identify differences between a base case and supplied case. Looking for a library to tell me similarity in percentage or something like that.

For Example:

I've 10 different HTML pages. * All of them are 404 responses with only one 2 lines of random code (such as time or quote of the day).

Now when I supply a new 404 page I want a result back such as "%80" similar,however if I supply another page totally different or same website but quite different content I should get something lile "%20 similar".

Basically what I want to do is, when I've got a new response I want to identify if the new response is similar to these 10 pages which I supplied before.

I'm trying to solve this in .NET, A library or an algorithm recommendation would be great.

A quick and dirty way would be to compute the Levenshtein distance of the markup.

http://en.wikipedia.org/wiki/Levenstein_distance

TraumaPony 2008-09-20 11:00:58

for your task it would be enough to run a command line diff utility and analyze the results.

Alternatively you need to implement an LCS algorithm but to me it would be an overkill.

Ilya Kochetov 2008-09-20 11:05:22

for your task it would be enough to run a command line diff utility and analyze the results.

This is not a one time job really, I need a solution integrated into an application.

And diff has it's own problems in here, because I can not tell diff to process 5 pages and ignore the bits those constantly changing.

These parts can be big, it can 2kb of standard text keep changing. And I think from diff point of view it's a big change however from my point of view it's just a change of one section (which is known to be changed in all other 9 files therefore should be ignored totally).

Maybe a diff library can do that but I'm not aware of such a library.

2008-09-20 11:13:10

+1 A:

Rather than using a diff tool you could use a copy/paste detector (cpd). Then you can configure a threshold of how alike you want files to be.

As an aside, I have used these in the past to track down cheaters in school.

Sam

Sam Reynolds 2008-09-20 11:37:01

basic algorithm i would use:

parse the text content of pages on both sides, the old and the new. as you parse keep track on how many bytes have processed to be used later on to determine how many % has changed. Now that you have complete story on each side, build up anchor points of sameness. For every achor point of sameness that you've got, try to expand that forward and backward. Identify any gaps between your sameness achor points as a difference. Loop through every difference gap that you've identified and sum up their byte counts. calculate your percentage of diff by using the total ammount difference byte count and the total byte of the story (the one you calculated earlier).

RWendi 2008-09-20 11:53:49

ansaurus

tags:

views:

answers:

Identifying if 2 HTML pages are similar

related questions