ansaurus

Question

Python SequenceMatcher Overhead - 100% CPU utilization and very slow processing

Answer 1

+2 A:

My answer is a different approach to the problem altogether: Try using a version-control system like git to investigate how the directory changed over the years.

Make a repository out of the first directory, then replace the contents with the next year's directory and commit that as a change. (or move the .git directory to the next year's directory, to save on copying/deleting). repeat.

Then run gitk, and you'll be able to see what changed between any two revisions of the tree. Either just that a binary file changed, or with a diff for text files.

Peter Cordes 2009-12-08 23:43:18

Why not just GNU diff then instead ?

ChristopheD 2009-12-08 23:44:35

@ChristopheD, Git uses diff to display differences. However, it does a lot for you: it figures out which files haven't changed, and gives a diff on just the ones that have changed. Then, gitk wraps all this in a friendly GUI where you can easily browse through different revisions. This answer makes sense to me.

steveha 2009-12-08 23:51:58

@PeterCordes: That is the nice solution - using GIT's metadata to get info about where the change was. However, it won't help me as currently all the previous years data is backed up in a file system and i don't have access to the CVS directly.@ChristopheD: Actually i was using the diff command before from a shell script but then you get only details at line level (add/delete). With python difflib, you get precise info about which characters were inserted, deleted, replaced from an API. So, i switched over to python difflib.

src 2009-12-09 12:54:32

There are other diff programs, e.g. wdiff for word-differences, not line-oriented. You're probably close to having your python version working, though, so maybe you should stick with that.

Peter Cordes 2009-12-09 18:43:13

ansaurus

tags:

views:

answers:

Python SequenceMatcher Overhead - 100% CPU utilization and very slow processing

related questions