tags:

views:

123

answers:

1

Hi,

I am using difflib to compare files in two directories (versions from consecutive years). First, i am using filecmp to find files that have changed and then iteratively using difflib.SequenceMatcher to compare them and generate a html diff as explained here.

However, I find that the program is taking too long to run and python is utilizing 100% CPU. On time profiling, i found that the seqm.get_opcodes() call which is taking all the time.

Any insight would be appreciated. Thanks !

Code:

#changed_set contains the files to be compared
for i in changed_set:
  oldLines = open(old_dir +"/" + i).read()
  newLines = open(new_dir +"/" + i).read()
  seqm = difflib.SequenceMatcher(lambda(x): x in string.whitespace, oldLines, newLines)
  opcodes = seqm.get_opcodes() #XXX: Lots of time spent in this !
  produceDiffs(seqm, opcodes)
  del seqm
+2  A: 

My answer is a different approach to the problem altogether: Try using a version-control system like git to investigate how the directory changed over the years.

Make a repository out of the first directory, then replace the contents with the next year's directory and commit that as a change. (or move the .git directory to the next year's directory, to save on copying/deleting). repeat.

Then run gitk, and you'll be able to see what changed between any two revisions of the tree. Either just that a binary file changed, or with a diff for text files.

Peter Cordes
Why not just GNU diff then instead ?
ChristopheD
@ChristopheD, Git uses diff to display differences. However, it does a lot for you: it figures out which files haven't changed, and gives a diff on just the ones that have changed. Then, gitk wraps all this in a friendly GUI where you can easily browse through different revisions. This answer makes sense to me.
steveha
@PeterCordes: That is the nice solution - using GIT's metadata to get info about where the change was. However, it won't help me as currently all the previous years data is backed up in a file system and i don't have access to the CVS directly.@ChristopheD: Actually i was using the diff command before from a shell script but then you get only details at line level (add/delete). With python difflib, you get precise info about which characters were inserted, deleted, replaced from an API. So, i switched over to python difflib.
src
There are other diff programs, e.g. wdiff for word-differences, not line-oriented. You're probably close to having your python version working, though, so maybe you should stick with that.
Peter Cordes