Hi,
I am using difflib to compare files in two directories (versions from consecutive years). First, i am using filecmp to find files that have changed and then iteratively using difflib.SequenceMatcher to compare them and generate a html diff as explained here.
However, I find that the program is taking too long to run and python is utilizing 100% CPU. On time profiling, i found that the seqm.get_opcodes() call which is taking all the time.
Any insight would be appreciated. Thanks !
Code:
#changed_set contains the files to be compared
for i in changed_set:
oldLines = open(old_dir +"/" + i).read()
newLines = open(new_dir +"/" + i).read()
seqm = difflib.SequenceMatcher(lambda(x): x in string.whitespace, oldLines, newLines)
opcodes = seqm.get_opcodes() #XXX: Lots of time spent in this !
produceDiffs(seqm, opcodes)
del seqm