views:

535

answers:

6

I'm looking for the appropriate algorithm to use to compare two files. I think I can do better than diff due to some added constraints.

What I have are two text files each containing a list of files. They are snapshots of all the files on a system taken at two different times. I want to figure out which files have been added or deleted between the two snapshots.

I could use diff to compare these files, but I don't want to because:

  1. diff tries to group changes together, finding which chunks in a file have changed. I'm only looking for a list of lines that have changed, and that should be a much simpler problem than finding the longest-common-subsequence or some such thing.

  2. Generalized diff algorithms are O(mn) in runtime or space. I'm looking for something more like O(m+n) in time and O(1) in space.

Here are the constraints on the problem:

  1. The file listings are in the same order in both files. They are not necessarily in alphabetical order, but they are in the same relative order.

  2. Most of the time there will be no differences between the lists. If there are differences, there will usually only be a handful of new/deleted files.

  3. I don't need to group the results together, like saying "this entire directory was deleted" or "lines 100-200 are new". I can individually list each line that is different.

I'm thinking this is equivalent to the problem of having two sorted lists and trying to figure out the differences between the two lists. The hitch is the list items aren't necessarily sorted alphabetically, so you don't know if one item is "greater" than another. You just know that the files that are present in both lists will be in the same order.

For what it's worth, I previously posted this question on Ask Metafilter several years ago. Allow me to respond to several potential answers upfront.

Answer: This problem is called Longest Common Subsequence.

Response: I'm trying to avoid the longest common subsequence because simple algorithms run in O(mn) time/space and better ones are complicated and more "heuristical". My intuition tells me that there is a linear-time algorithm due to the added constraints.

Answer: Sort them alphabetically and then compare.

Response: That would be O(m log m+n log n), which is worse than O(m+n).

+8  A: 

Read one file, placing each file-name into a HashSet-like data structure with O(1) add and O(1) contains implementations.

Then read the seconds file, checking each file-name against the HashSet.

Total algorithm if file one has length m and the second file has length n is O(m+n) as required.

Note: This algorithm assumes the data-set fits comfortably in physical memory to be fast.

If the data set cannot easily fit in memory, the lookup could be implemented using some variation of a B-Tree with disk paging. The complexity would then be O(mlog m) to initially setup and O(n log m) for each other file compare.

Ben S
+1, I love how you're _not_ needing the tricky "same ordering" bit!-)
Alex Martelli
There are usually more processor cycles then memory.
ilya n.
Good one. I've edited the question to specify an O(1) space requirement. Reading either file into memory seems like overkill, an oversight of my original specification. :-)
John Kugelman
I added a new answer with a better memory constraint.
Ben S
OP does say that O(m log m + n log n) isn't good enough, though... so O((m + n) log m) probably isn't either. :(
ephemient
Yes, he edited his question due to my answer of throwing memory at it. I've started a new answer with a more memory sensitive solution.
Ben S
+2  A: 

From a theoretical point of view, comparing the editing distance between two strings (because here you have strings in a funny language where a 'character' is a file name) cannot be made O(m+n). But here we have simplifications.

An implementation of an algorithm in your case (should contain mistakes):

# i[0], i[1] are undoable iterables; at the end they both return Null

while (a = i[0].next()) && (b = i[1].next()) :    # read one item from each stream
    if a != b:                 # skip if they are identical
        c = [[a],[b]]          # otherwise, prepare two fast arrays to store difference
        for (w = 1; ; w = 1-w) # and read from one stream at a time
             nxi = Null        
             if (nx = i[1-w].next()) in c[w]:  # if we read a new character that matches
                  nxi = c[w].index(nx)          
             if nx is Null: nxi = -1           # or if we read end of stream
             if nxi is not Null:               # then output that we found some diff
                 for cc in c[1-w]: yield cc              # the ones stored 
                 for cc in c[w][0:nxi-1]: yield cc       # and the ones stored before nx
                 for cc in c[w][nxi+1:]: i[w].undo(cc)   # about the remainder - put it back
                 break                         # and return back to normal cycle
 # one of them finished
 if a: yield a
 if b: yield b
 for ci in i: 
     while (cc = ci.next()): yield cc

There are data structures that I call fast arrays -- they are probably HashSet things, but the ones that remember ordering. The addition and lookup in them should be O(log N), but the memory use O(N).

This doesn't use any memory or cycles beyond O(m+n) outside of finding differences. For every 'difference block' -- the operation that can be described as taking away M consequtive items and adding N ones -- this takes O(M+N) memory and O(MN) O(Mlog N+Nlog M) instructions. The memory is released after a block is done, so this isn't much of a thing if you indeed only have small changes. Of course, the worst-case performance is as bad as with generic method.

ilya n.
The important thing to realize is that we don't actually need to compare the Strings. The problem is different.
Ben S
He does need to compare two strings in an alphabet where character is a possible file name. But we can do it more effectively if we know some constraints.
ilya n.
Interesting. The quadratic cost could be improved to linear by throwing some memory at the problem a la Ben S's answer. In particular, add the lines in `A` to `setA` and the lines in `B` to `setB`, checking each successive line in each file against the opposite file's set. Once you find a line that is in the opposite set (or you reach EOF in both files) you have re-synchronized yourself. That seems like a darn fast algorithm!
John Kugelman
It can actually be done with less memory that my original answer. I added a new solution.
Ben S
Also, if you're not familiar with iterators, change yield cc -> print "Found difference: ", cc
ilya n.
@ilya n: No, you don't need to compare it to everything, you just need to hash it, and see if there's a collision. That's constant time, not quadratic.
Ben S
@Ben S. Thanks! i changed it. Note though it's not constant; I'll reply in comment to your answer.
ilya n.
+8  A: 

This isn't quite O(1) memory, the memory requirement in the order of the number of changes, but it's O(m+n) runtime.

It's essentially a buffered streaming algorithm that at any given line knows the difference of all previous lines.

// Pseudo-code:
initialize HashMap<Line, SourceFile> changes = new empty HashMap
while (lines left in A and B) {
    read in lineA from file A
    read in lineB from file B

    if (lineA.equals(lineB)) continue

    if (changes.contains(lineA) && changes.get(lineA).SourceFile != A) {
         changes.remove(lineA)
    } else {
         changes.add(lineA, A)
    }

    if (changes.contains(lineB) && changes.get(lineB).SourceFile != B) {
         changes.remove(lineB)
    } else {
         changes.add(lineB, B)
    }
}

for each (line in longerFile) {
    if (changes.contains(line) && changes.get(line).SourceFile != longerFile) {
         changes.remove(line)
    } else {
         changes.add(line, longerFile)
    }
}

Lines in the HashMap from SourceFile == A have been removed
Lines in the HashMap from SourceFile == B have been added

This heavily relies on the fact the the files are listed in the same relative order. Otherwise, the memory requirement would be much larger than the number of changes. However, due to that ordering this algorithm shouldn't use much more memory than 2 * numChanges.

Ben S
I'm not sure that's right... on inputs (1, 2, 3, 4) and (1, 3, 4) it will 'remove 1A and 1B', then 'add 2A and 3B' and 'add 3A and 4B' and outside the read loop 'add 4A' ... when clearly '2A' is the only difference.
jerryjvl
Yeah, I noticed, it's fixed now.
Ben S
The hashing is done only on the line, and maps to the source file, so my old algorithm would actually have 3 and 4 as additions.
Ben S
Yes, your algorithm, if you expand it, will be essentially the same as mine. The crucial thing is to note that when you hash a very small number of elements, you might as well just do an array :)
ilya n.
That looks a lot better... I'd recommend making the operations on your hash-set a little more explicit though ('add (line, A) from HashSet', etc.) it'll make the solution more readable and understandable :)
jerryjvl
I wanted to write a more detailed explanation of how my solution works, but your pseudocode nailed it perfectly, thanks! Still, you don't need HashSets at all, you can be perfectly fine with ordered set (which is array).
ilya n.
Well, mine is still somewhat different... you use memory O(total changes); I go into clean state after each 'block' so I use O(maximum block of changes)
ilya n.
Ordered set = log n insert and log n lookup, hashes are required for constant time additions, removals and lookups.
Ben S
Yes, you were partly right on that. But that's not constant, that's logarithmic. It's impossible to have a data structure with constant lookup time. Of course, we're talking here about N the size of block, we're promised it's small, so log N is **essentially** constant.
ilya n.
But theoretically it's not a constant! Because otherwise we would suddenly have a generic O(m+n) algorithm.
ilya n.
@ilya n: "It's impossible to have a data structure with constant lookup time." Sure it's possible, that's the purpose of a hashset/hashtable. See: http://en.wikipedia.org/wiki/Perfect_hash_function
Ben S
Really close. I think the edge case logic when handling the longer file is not quite right, though. Consider diffing the lists [2] and [1, 2]. It looks like "2" will get added to the change list if I'm interpreting "while (lines left in A and B)" correctly. Also, a quick escape at the top of the loop would be dandy, e.g. "read line from A; read line from B; if A = B: continue". That'd save on a lot of adds followed by immediate removes.
John Kugelman
Alright, should be pretty much correct now.
Ben S
No, the hash function needs log N bits and then it needs to compute these log N bits. This means if you have N data and want to compute hash function, it's log N.Otherwise you would have a solution with O(m+n) to the string subsequence problem, which would be quite unexpected.
ilya n.
Hashing doesn't add to the big-O efficiency. You could calculate the hash code for each file name as you read them in. The _m_ and _n_ in _O(m+n)_ are the number of file names in each list, not the lengths of those file names. You can pretend the problem involves lists of integers as per ephemient's answer, if you wish, that is equivalent.
John Kugelman
Well, I think here we are "arguing" about whether there is 1 or log N somewhere. Let me remind what is N: N is a typical diff block size. So N is a small number. It's so small that it's obviously smaller then filename_length (otherwise filenames in your set/my array collide, which is bad). I can reformulate this with integers: then we will have integers with *more than log N bits*. C
ilya n.
*If* N is large, comparing the integers of log N bit length actually does take log N time, even though they are integers! However, if you assume the input contains less files then data type int, then processor actually compares int in a fixed time. In other words we can assume log N = const, and so we're both right.
ilya n.
They only time it's relevant is the worst case. And we can't have N too big anyway, so the effects of log N won't be very visible.
ilya n.
@ilya: Hashtables provide O(1) lookup time *provided* that (a) the size of the hashtable is bounded by a constant and (b) certain pathological sequences of keys have not been added. (Of course, *any* data structure provides O(1) lookup time if the size of the data structure is bounded...) Although assumption (a) is often stretched to breaking point by people claiming fast algorithms for unbounded N, in this case you might reasonably claim that the number of file changes between backups is bounded by a constant independent of N.
j_random_hacker
@j_random_hacker: of course, it's theoretically bound by log(N) which is practically bound by log(10^6) which is a constant :)
ilya n.
+1  A: 

If you accept that dictionaries (hash maps) are O(n) space and O(1) insert/lookup, this solution ought to be O(m+n) in both time and space.

from collections import defaultdict
def diff(left, right):
    left_map, right_map = defaultdict(list), defaultdict(list)
    for index, object in enumerate(left): left_map[object] += [index]
    for index, object in enumerate(right): right_map[object] += [index]
    i, j = 0, 0
    while i < len(left) and j < len(right):
        if left_map[right[j]]:
            i2 = left_map[right[j]].pop(0)
            if i2 < i: continue
            del right_map[right[j]][0]
            for i in range(i, i2): print '<', left[i]
            print '=', left[i2], right[j]
            i, j = i2 + 1, j + 1
        elif right_map[left[i]]:
            j2 = right_map[left[i]].pop(0)
            if j2 < j: continue
            del left_map[left[i]][0]
            for j in range(j, j2): print '>', right[j]
            print '=', left[i], right[j2]
            i, j = i + 1, j2 + 1
        else:
            print '<', left[i]
            i = i + 1
    for j in range(j, len(right)): print '>', right[j]
>>> diff([1, 2, 1, 1, 3,    5, 2,    9],
...      [   2, 1,    3, 6, 5, 2, 8, 9])
< 1
= 2 2
= 1 1
< 1
= 3 3
> 6
= 5 5
= 2 2
> 8
= 9 9

Okay, slight cheating as list.append and list.__delitem__ are only O(1) if they're linked lists, which isn't really true... but that's the idea, anyhow.

ephemient
Yes, I've already suggested something similar to this. the O(m+n) space is not acceptable. The set is too large.
Ben S
I can shrink memory usage in the average case, but avoiding a worst-case O(m+n) space while maintaining O(m+n) time is challenging. I'll sleep on it and see if I come up with something else tomorrow...
ephemient
Worst-case space will be O(m+n), I don't think it's possible to sidestep that.
ilya n.
A: 

A refinement of ephemient's answer, this only uses extra memory when there are changes.

def diff(left, right):
    i, j = 0, 0

    while i < len(left) and j < len(right):
        if left[i] == right[j]:
            print '=', left[i], right[j]
            i, j = i+1, j+1
            continue

        old_i, old_j = i, j
        left_set, right_set = set(), set()

        while i < len(left) or j < len(right):
            if i < len(left) and left[i] in right_set:
                for i2 in range(old_i, i): print '<', left[i2]
                j = old_j
                break

            elif j < len(right) and right[j] in left_set:
                for j2 in range(old_j, j): print '>', right[j2]
                i = old_i
                break

            else:
                left_set .add(left [i])
                right_set.add(right[j])
                i, j = i+1, j+1

    while i < len(left):
        print '<', left[i]
        i = i+1

    while j < len(right):
        print '>', right[j]
        j = j+1

Comments? Improvements?

John Kugelman
Yes, looks like the same I did, but you did it cleaner.
ilya n.
Note that you would have to rewrite it with iterators, because holding input data in arrays in unrealistic -- it's already agreed inputs can be big.
ilya n.
Actually I'm not sure you exiting correctly when left[i] in right_set
ilya n.
Storing a small set of lines in memory while you're trying to figure out what's changed is fine by me, if you want to incorporate that into your solution. Or perhaps we can get two solutions written up, one that is slower but uses no extra memory (yours) and one that is faster but uses a bit of extra memory figuring out what's changed (mine, Ben S's).
John Kugelman
No, no, I'm pointing to a slightly different thing. First, our solutions are the same at this point; mine is not slower; well, i *said* it's N log N, where N is a typical diff size, but this exists in your solution too -- HashSet is NOT constant lookup! HashSet is log lookup. Second, I want to point out that you can't have arrays left, right like you do, since we agreed to not use O(m+n) of memory and those were files in the problem. So, you'll have to read streams/iterators. Third, Ben S solution is now slower than yours(+mine), but can be fixed, after which we'll have 3 identical solutions.
ilya n.
It's just that N is so small, essentially a constant, that log N is "even more certainly a constant". Of course, in the worst case log N = log n = original problem.
ilya n.
According to Wikipedia: "In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key-value pairs, at constant average cost per operation."
John Kugelman
More precisely, hash table lookup time is amortized constant-time. First you calculate an item's hash code and find its bucket. Then you search through that bucket for the item. If buckets are linked lists then the time it takes to do this is proportional to the average number of items in each bucket. In a "well-dimensioned" hash table the number of buckets is based on the number of items in the hash table, and when this passes a certain ratio the hash table is enlarged with more buckets to maintain a low average # of items per bucket.
John Kugelman
when you search the bucket, comparison is NOT constant. comparison is log N. to compare two objects of which you only know that they can be represented as numbers from 1 to 2^N you need to query for N bits of information. The misunderstanding comes from the fact that always speak of comparison being constant, but it is so when we compare ints or bools or ulongs or whatever processor compares with CMP AX, BX or similar. here if you had to put N objects into your set, you MUST spend at least log N average to lookup.
ilya n.
i posted comments above. i think if we don't agree on the 1 = log N yet, we should open a separate question.
ilya n.
It's not important how long it takes to compare file names to each other. It's the number of comparisons that is important, not the cost of the comparisons. A hash table does an (amortized) constant NUMBER of comparisons. Or put it this way: if you consider algorithmic complexity in terms of the number of bits per item B and number of items N, then a hash table lookup is O(B) and a binary tree lookup is O(B log N).
John Kugelman
It's not really good that we are talking about numbers, since those would have to be ridiculously large for this to matter. Let's talk about string of length K (or other objects). Let's abstract from our problem. I'd say a *reasonable* definition is that comparing two strings of length K is O(K). You say it's 1 comparison. We're both right in that in a sense. When you define algorithmic complexity, you could certainly adopt different definitions. For the original problem this doesn't matter. A hash table lookup is O(B*comparison_cost). Does this settle things or shall we open a question?
ilya n.
A: 

In practice, a log factor difference in sorting times is probably insignificant -- sort can sort hundreds of thousands of lines in a few seconds. So you don't actually need to write any code:

sort filelist1 > filelist1.sorted
sort filelist2 > filelist2.sorted
comm -3 filelist1.sorted filelist2.sorted > changes

I'm not claiming that this is necessarily the fastest solution -- I think Ben S's accepted answer will be, at least above some value of N. But it's definitely the simplest, it will scale to any number of files, and (unless you are the guy in charge of Google's backup operation) it will be more than fast enough for the number of files you have.

j_random_hacker
Indeed. In reality, I changed the client program that generated the file listings to simply generate them in alphabetical order, a simple enough change. That made it trivial to compare the two listings.
John Kugelman