How to determine a text block of a file in one version come from which file in the previous version? | ansaurus

tags:

similarity

views:

14

answers:

1

+1 Q:

How to determine a text block of a file in one version come from which file in the previous version?

The problem is described below: Suppose I have a list of files in one version(say A,B,C,D). In the next version I have the following files(A,E,F,G). There are some similarities in their contents. The files in the later version comes from the previous version by file name renaming, content addition, deletion or partial modification or without any change( for example A is not changed).

I take a block of text from a file(E, 2nd version) and check which files(in the 1st version) contain this text block. I found that B,C and D contain the text fragment. I want to determine from which file(B or c or d) this text block actually comes from.(I assume that E is a file whose name change in the second version).

Since the contents may be changed, added or deleted in the later version, so in order to determine similarity I use LCS algorithm. But I cannot map the file with its previous version. I think one possible approach might be to use the location information of the match text blocks. But this heuristics not always work. Is there any research or algorithm exist to find so. Any direction will be helpful. Thanks in advance.

A:

I think it may be helpful to take a look at Subversion, and its capability to track file renaming between versions. http://svnbook.red-bean.com/

It's tried and tested, because it's used by so many developers. Renaming has to occur by using subversion tools though, but there are many (command line, file explorer integration for different OS, GUIs, IDEs, you name it). It also covers moving files between directories, and merging several lines of changes (branches).

Chris Lercher 2010-03-20 21:26:44

related questions

Speeding up self-similarity in an image

Saying "C & C# are equal by functionality, but not by concept"

Detecting image equality at different resolutions

how to get the similar texts from a lot of pages?

Package to compare LSA, TFIDF, Cosine metrics and Language Models

Tips to show similarities in files

Speed up text comparisons (feature vectors) with spatial MySQL features

Find a similarity of two vector shapes

How to spot and analyse similar patterns like Excel does?

Pearson Similarity Score, how can I optimise this further?

Similarity of two texts (adaptive local alignment of keywords?)

How can I measure the similarity between 2 strings?

Visual similarity search algorithm

Cosine similarity vs Hamming distance

C# comparing similar strings

Algorithm for similarity (of topic) of news items

Is there any solution to know the similarity of two pdf without detail content compare

Determining if two or more summaries are similar

A better similarity ranking algorithm for variable length strings

Calculating Binary Data Similarity

Textual Irregularities

How do I determine the longest similar portion of several strings?

Word comparison algorithm

Algorithm to find similar text

Identifying if 2 HTML pages are similar