tags:

views:

14

answers:

1

The problem is described below: Suppose I have a list of files in one version(say A,B,C,D). In the next version I have the following files(A,E,F,G). There are some similarities in their contents. The files in the later version comes from the previous version by file name renaming, content addition, deletion or partial modification or without any change( for example A is not changed).

I take a block of text from a file(E, 2nd version) and check which files(in the 1st version) contain this text block. I found that B,C and D contain the text fragment. I want to determine from which file(B or c or d) this text block actually comes from.(I assume that E is a file whose name change in the second version).

Since the contents may be changed, added or deleted in the later version, so in order to determine similarity I use LCS algorithm. But I cannot map the file with its previous version. I think one possible approach might be to use the location information of the match text blocks. But this heuristics not always work. Is there any research or algorithm exist to find so. Any direction will be helpful. Thanks in advance.

A: 

I think it may be helpful to take a look at Subversion, and its capability to track file renaming between versions. http://svnbook.red-bean.com/

It's tried and tested, because it's used by so many developers. Renaming has to occur by using subversion tools though, but there are many (command line, file explorer integration for different OS, GUIs, IDEs, you name it). It also covers moving files between directories, and merging several lines of changes (branches).

Chris Lercher