tags:

views:

66

answers:

2

Is there a diff algorithm that does not group unrelated blocks?

For example:

hello world
lorem ipsum dolor sit amet

vs.

Hello World
Lorem Ipsum Dolor Sit Amet

Comparing these (e.g. with standard Unix diff) generally results in the following:

< hello world
< lorem ipsum dolor sit amet
---
> Hello World
> Lorem Ipsum Dolor Sit Amet

However, a line-by-line comparison like the following would seem more sensible:

< hello world
---
> Hello World

< lorem ipsum dolor sit amet
---
> Lorem Ipsum Dolor Sit Amet

The latter, IMO, makes it much easier to analyze minor changes. (Note that I'm concerned with human readability here, not machine readability.)

I understand diff'ing is a complex issue, but this often leaves me puzzled nonetheless.

+2  A: 

Although it is intentional that diff behaves like that you can change it by throwing in blank lines. This will get the result you want.

1:

hello world

lorem ipsum dolor sit amet

Same

2:

Hello World

Lorem Ipsum Dolor Sit Amet

Same

The line number has to be fixed though (n/2 + 1).

1c1
< hello world
---
> Hello World
3c3
< lorem ipsum dolor sit amet
---
> Lorem Ipsum Dolor Sit Amet

If multiple lines replace one line the output may still not be what you want:

1,3c1
< hello world
<
< lorem ipsum dolor sit amet
---
> Hello World
Thomas Jung
Thanks - I have used this workaround before, but it's not a viable generic solution (see my response to mizipzor).I suppose the LCS problem explains why it is like it is, so I'll just have to live with it...
AnC
Dont live with it, every software breakthrough starts with an annoyed programmer ;)
mizipzor
Hehe - sadly, last time I delved into diff algorithms they made my head spin...
AnC
+1  A: 

The diff algorithm is a solution to the longest common subsequence problem. However, it seems youre not interested in another algorithm. Because, related or not, both lines have changed and what you are talking about is how the difference is presented in text.

Thomas Jung showed the original format. Wikipedia shows a few variations. But take the time to experiment some.

diff original new

Will produce the original format.

diff -c original new

Will produce the context format.

diff -u original new

Will produce the unified format. For some trivia, this is the one most commonly used, patches to open source projects are more often than not requested in this format.

Of course, if the way the difference is presented to you is crucial, I think you will find any of the diff viewers vastly superior.

mizipzor
Thanks - I know about the different formats (I generally work with `git diff`), but they all present the same issue. This applies to both code and non-code (e.g. wikis) scenarios; minor changes - like indentation or typo corrections - can appear dramatic because it's not clear that each individual line just differs slightly from what it was before.
AnC
Did you check the graphical viewers? Some of them does not only highlight the changed line but the changed characters in that line. I like that sometimes when the lines are a little to long, might help you as well. Also note that in most graphical viewers the lines are not "grouped together" in any way. They dont need to be since the change notification is usually a change in the lines background color.
mizipzor
I have checked various different options - but take GitHub's diff visualization, for example; while that highlights inline changes, it only works if such changes are not on subsequent lines (i.e. blocks take precedence).
AnC
have you considered writing your own? strictly line by line should be very simple to implement... could be something as simple as piping git diff output to a script you wrote!
mizipzor
I've considered this - but that would mean it's limited to my local setup, and I've come to realize the issues I've mentioned are mainly of concern in a collaborative context (e.g. GitHub)...
AnC