I'm looking for the appropriate algorithm to use to compare two files. I think I can do better than diff
due to some added constraints.
What I have are two text files each containing a list of files. They are snapshots of all the files on a system taken at two different times. I want to figure out which files have been added or deleted between the two snapshots.
I could use diff
to compare these files, but I don't want to because:
diff
tries to group changes together, finding which chunks in a file have changed. I'm only looking for a list of lines that have changed, and that should be a much simpler problem than finding the longest-common-subsequence or some such thing.Generalized diff algorithms are O(mn) in runtime or space. I'm looking for something more like O(m+n) in time and O(1) in space.
Here are the constraints on the problem:
The file listings are in the same order in both files. They are not necessarily in alphabetical order, but they are in the same relative order.
Most of the time there will be no differences between the lists. If there are differences, there will usually only be a handful of new/deleted files.
I don't need to group the results together, like saying "this entire directory was deleted" or "lines 100-200 are new". I can individually list each line that is different.
I'm thinking this is equivalent to the problem of having two sorted lists and trying to figure out the differences between the two lists. The hitch is the list items aren't necessarily sorted alphabetically, so you don't know if one item is "greater" than another. You just know that the files that are present in both lists will be in the same order.
For what it's worth, I previously posted this question on Ask Metafilter several years ago. Allow me to respond to several potential answers upfront.
Answer: This problem is called Longest Common Subsequence.
Response: I'm trying to avoid the longest common subsequence because simple algorithms run in O(mn) time/space and better ones are complicated and more "heuristical". My intuition tells me that there is a linear-time algorithm due to the added constraints.
Answer: Sort them alphabetically and then compare.
Response: That would be O(m log m+n log n), which is worse than O(m+n).