views:

363

answers:

3

I would like to show differences between two blocks of text. Rather than comparing lines of text or individual characters, I would like to just compare words separated by specified characters ('\n', ' ', '\t' for example). My main reasoning for this is that the block of text that I'll be comparing generally doesn't have many line breaks in it and letter comparisons can be hard to follow.

I've come across the following O(ND) logic in C# for comparing lines and characters, but I'm sort of at a loss for how to modify it to compare words.

In addition, I would like to keep track of the separators between words and make sure they're included with the diff. So if a space is replaced by a hard return, I would like that to come up as a diff.

I'm using Asp.Net (c#) to display the entire block of text including the deleted original text and added new text (both will be highlighted to show that they were deleted/added). A solution that works with those technologies would be appreciated.

Any advice for how to accomplish this is appreciated.

+1  A: 

Other than a few general optimizations, if you need to include the separators in the comparison you are essentially doing a character by character comparison with breaks. Though you could use the O(ND) you linked, you are going to make as many changes to it as you would basically writing your own.

The main problem with difference comparison is finding the continuation (if I delete a single word, but leave the rest the same).

If you want to use their code start with the example and do not write the deleted characters, if there are replaced characters in the same place, do not output this result. You then need to compute the longest continuous run of "changed" words, highlight this string and output.

Sorry thats not much of an answer, but for this problem the answer is basically writing and tuning the function.

GrayWizardx
A: 

Well String.Split with '\n', ' ' and '\t' as the split characters will return you an array of words in your block of text.

You could then compare each array for differences. A simple 1:1 comparison would tell you if any word had been changed. Comparing:

hello world how are you

and:

hello there how are you

would give you that world and changed to there.

What it wouldn't tell you was if words had been inserted or removed and you would still need to parse the text blocks character by character to see if any of the separator characters had been changed.

ChrisF
I'm afraid that String.Split for large blocks of text will be inefficient.
Vadmyst
+5  A: 

Microsoft has released a diff project on CodePlex that allows you to do word, character, and line diffs. It is licensed under Microsoft Public License (Ms-PL).

http://diffplex.codeplex.com/

Jim Geurts
DiffPlex lets you define a custom function for how to partition the text before it is diffed.You can use the method:DiffResult CreateCustomDiffs(string oldText, string newText, bool ignoreWhiteSpace, Func<string, string[]> chunker)where chunker tells DiffPlex what are the atomic units to compare against each other.
Matthew Manela