views:

63

answers:

6

Hi,

imagine you have 2 texfiles (let's say 500kB - 3 MB large): the first is original, the second is the update of this original. How can I find out, what was changed (inserted, deleted) and where the changes took place (in the update file in comparison to original)?

  1. Is there any tool or library somewhere?
  2. Resides this function in any well known text editors?
  3. Does anybody know an algorithm? Or what are the common methods to solve it on the large scale?
  4. What would you do if you face this kind of problem?

Thanx for your ideas...

+1  A: 

You can try Notepad++ it is an open source text editor that has a compare files plug in.

Itay
+2  A: 

What you're describing sounds exactly like a diff-style tool. This sort of functionality is available in many of the more advanced text editors.

Michael Madsen
A while after your answer I found out, that at least Open Office has this feature... Thanks...
lyborko
+1  A: 

There is an extensive list of file comparison tools on wikipedia.

If you want to do it programatically I've used SED and AWK on Unix systems before now - and there are windows versions. Basically these types of file processing languages allow you to read and compare text files on a line-by-line basis and then allow you to do something with the differences (for example save them to a third file).

amelvin
Thanks very much for the links above. I tried to develop this small tool by myself, but it seemed to me quite difficult to reinvent what was invented a long time ago... I am not sure now, if I implement it in application myself or I use every time some comparison tool.
lyborko
A: 

Is there any tool or library somewhere?

There are many. Try using diff, it's a command line based file comparison utility that works fine for small diffs. But if the two file differs a lot, it'll be hard to understand the output of diff. In that case you can use visual file diff tools like diffmerge, Kompare or vimdiff.

Resides this function in any well known text editors?

Many modern editors like vim, Eclipse have this visual diffing feature..

Does anybody know an algorithm? Or what are the common methods to solve it on the large scale?

It is based on the Longest common subsequence algorithm, popularly known as LCS.

LCS of old text and new text gives the part that has remain unchanged. So the parts of old text that is not part of LCS is the one that got changed.

What would you do if you face this kind of problem?

I'd use one of the visual diff tools mentioned to see what and where the changes were made.

codaddict
A: 

The unix diff tool does line-by-line differences; there is a GNU tool called wdiff which will do word-by-word differences, and should be available as a package for most Linux distributions or Cygwin.

Classic papers on the algorithm are:

Matthew Slattery
A: 

GNU Diffutils http://www.gnu.org/software/diffutils/

Pleomax