tags:

views:

417

answers:

5

I want to implement word document differ, what algorithms does it requires to implement?

+8  A: 

A diff is essentially just a solution to the longest common sub-sequence problem.

The optimal solution requires knowledge of dynamic programming so it's a fairly complex problem to solve.

However, it can also be done by constructing a suffix-tree. Both algorithms are outlined here.

Ben S
That's generally when you assume your document to be a stream of characters or bytes. Here the question is however about word document. Before implementing such an algorithm you need to ask yourself a question is 'Hello World' in blue 8pt Verdana the same as 'Hello World' in red 10pt Arial, etc.
quosoo
Yes, obviously the basic algorithms will require additional logic to parse for such differences, but the core of the algorithm will still be the same.
Ben S
+6  A: 

See An O(ND) Difference Algorithm for C#.

Galwegian
+2  A: 

As Ben S indicated, the differencing problem can be addressed generally by solving the longest common sub-sequence problem. More specifically, The Hunt-McIlroy algorithm is one of the classic algorithms that have been applied to the problem (e.g in the implementation of Unix' diff utility).

Brandon E Taylor
+6  A: 

Well, generally speaking, diff'ing is usually solved by the Longest common subsequence problem. Also see the "Algorithm" section of the Wikipedia article on Diff:

The operation of diff is based on solving the longest common subsequence problem.

In this problem, you have two sequences of items:

   a b c d f g h j q z

   a b c d e f g i j k r x y z

and you want to find the longest sequence of items that is present in both original sequences in the same order. That is, you want to find a new sequence which can be obtained from the first sequence by deleting some items, and from the second sequence by deleting other items. You also want this sequence to be as long as possible. In this case it is

   a b c d f g j z

From the longest common subsequence it's only a small step to get diff-like output:

   e   h i   q   k r x y 
   +   - +   -   + + + +

That said, this all works fine with text based documents. Since Word Documents are effectively in a binary format, and include lots of formatting information and data, this will be far more complex. Ideally, you could look into automating Word itself as it has the ability to "diff" between documents, as detailed here:

Microsoft Word Tip: How to compare two documents for differences

CraigTP
There's two purposes to have a diff algorithm implementation: To store only the differences between versions, or to show the differences between versions. These are vastly different (no pun intended). LCS is usually only usable for showing the differences, but for optimal storage, more advanced algorithms are needed. For instance, if you cut a large portion from one section of the document, and paste it into another section, a good storage algorithm would detect that and not store it as "hey, a lot of new data just appeared here".
Lasse V. Karlsen
@Lasse - Agreed. Since the original question asker was talking about Word documents, I assumed they would be more interested in the "visual" side of diffing, rather than the storage side. However, you're correct in that for the storage side, you'd be looking into Delta Encoding/Compression (http://en.wikipedia.org/wiki/Delta_encoding) etc.
CraigTP
+1  A: 

The Most optimize solution of lcs is O(ND) Myer 's algorithm , and here is an algorithmic approach which I used to implement to diff office 2007 documents. Link to algorithm paper

Sunny