views:

52

answers:

3

I have a lot of HTML files (10,000's and GBs worth) scraped from a server and I want to check to make sure the server produces the same results after some modifications but ignore kinds of differences that don't matter, e.g. whitespace, missing newlines, timestamps, small changes in some kinds of number, etc.

Does anyone know of a tool for doing this? I'd really rather not do more filtering than I have to.

(Oh and it needs to run under linux)

A: 

I use winmerge alot in windows and from what i can see some people enjoy meld in linux, so perhaps that could do the trick for you http://meld.sourceforge.net/

Other examples i saw from a quick googling was Kompare,xxdiff.sourceforge.net, and kdiff3.sourceforge.net

(could only post 1 link so wrote the adresses to xxdiff and kdiff3 as text)

Gustav Syrén
meld seems to be a GUI (and they don't work well with 1000's of files). Also, a quick glance doesn't show any special handling for irrelevant changes in HTML.
BCS
A: 

Beyond Compare is purchased software that is actually worth the money (I never thought I'd hear myself typing that!). It is GUI based but handles thousands of files very well. It will allow you to specify unimportant changes with regular expressions as well as whitespace (beginning, middle and end of line). The feature set is very extensive, check out a trial download.

I do not work for this company, I just use Beyond Compare every day at work and enjoy it every time!

Bryan Ash
That's actually the program I'm trying to avoid using. It's not that it's a bad program (I agree it's worth the money even if the boss won't pay for it) but I'm looking at +20k files and counting and at that point I really just want an "N files match, M don't (and here is three of them)". Also I don't really want to have to go in and write the reg-ex my self as I'm sure to get it wrong (besides, BC seems to be strongly line oriented and one of the diffs I want to ignore is dropped newlines).
BCS
StackOverflow is a great place to ask for regex help.
Bryan Ash
Ok. where can I post 2.5GB of file for help on? The problem is that the debug cycle will take several minutes per try.
BCS
+1  A: 

You might consider using a clone detector such as our CloneDR. This tool parses large sets of computer program (HTML is special case) files, builds abstract syntax trees representing the essential structure of each files, and compares programs for similarity. Because it is comparing essential program structure, it ignores inessential differences such as comments and whitespace, and deterimines that two code segments are either identical or one can be obtained from the other by substituting other blocks of code. The latter allows the recognition of code that has been modified in various ways. You can see samples of clone detection runs on a variety of computer languages at the web site.

In your case, what you would be looking for are files in system A which are essentially clones (exact or near misses) of files in system B. As a general rule, if a file a is a variant of file b (e.g., with a few changes) the CloneDr will report it as a clone and show the exact differences.

At the scale of 20,000 files, I can see why you want a tool, and I can see why you want near-miss matches rather than exact matches.

Doesn't run under Linux, but I assume your problem is hard to enough to solve so that isn't what you are optimizing.

Ira Baxter
That sounds like almost exactly what I'm looking for. However I can't seem to find a price on it and I wouldn't be able to run it anyway as I don't have access to any windows systems. (Really good to know about though!)
BCS
Got to love it when a company won't publish a price for their product ... "What would you like your payment to be?"
Bryan Ash