views:

462

answers:

7

Does anyone has some tool or some recommended practice how to find a piece of code which is similar to some other code?

Often I write a function or a code fragment and I remember I have already written something like that before, and I would like to reuse previous implementation, however using plain text search does not reveal anything, as I did not use the variable names which would be exactly the same.

Having similar code fragments leads to unnecessary code duplication, however with a large code base it is impossible to keep all code in memory. Are there any tools which would perform some analysis of the code and marked fragments or functions which are "similar" in terms of functionality?

Consider following examples:

  float xDistance = 0, zDistance = 0;
  if (camPos.X()<xgMin) xDistance = xgMin-camPos.X();
  if (camPos.X()>xgMax) xDistance = camPos.X()-xgMax;
  if (camPos.Z()<zgMin) zDistance = zgMin-camPos.Z();
  if (camPos.Z()>zgMax) zDistance = camPos.Z()-zgMax;
  float dist = sqrt(xDistance*xDistance+zDistance*zDistance);

and

  float distX = 0, distZ = 0;
  if (cPos.X()<xgMin) distX = xgMin-cPos.X();
  if (cPos.X()>xgMax) distX = cPos.X()-xgMax;
  if (cPos.Z()<zgMin) distZ = zgMin-cPos.Z();
  if (cPos.Z()>zgMax) distZ = cPos.Z()-zgMax;
  float dist = sqrt(distX*distX +distZ*distZ);
A: 

You can use regex searchs, available in every good text editor and modern IDEs.

fbinder
What regex would you suggest to match the two codes I have provided?
Suma
It depends on the editor that you are using. Usually it´s something like: float <letter or digit group> = 0, <letter or digit group> = 0; <new line>, etc.
fbinder
A: 

Cannot offer you any help in the way of searching through your existing code, but for changing your practice you could always use one of the many 'Code Snippets' repositories already available.

Have a look at snippets.dzone.com for instance. If you dont want an online one there are also desktop applications available.

willcodejavaforfood
A: 

It is possible to detect similar peaces of code automatically, altough I don't know about products that do this. One could definitely write an algorithm to do this.

It would have to break up code into their elemental objects and compare structures.

Jonathan van de Veen
A: 

It seems to me this has been already asked and answered several times:

http://stackoverflow.com/questions/204177/what-tool-to-find-code-duplicates-in-c-projects

http://stackoverflow.com/questions/191614/how-to-detect-code-duplication-during-development

I suggest closing as duplicate here.


Actually I think it is a more general search problem, like: How do I search if the question was already asked on StackOverflow?

Suma
You have the same question listed twice, there.
Sebastian Krog
@Sebastian Removed!
Shoban
So this question, "How to find similar code", is an example of the meta question, "How to find similar questions" :-?
Ira Baxter
+3  A: 

You can use Simian. It is a tool that detects duplicate code in Java, C#, C++, XML, and many more (even plain txt files). It even integrates nicely in a tool like CruiseControl.

Razzie
+1  A: 

The CloneDR finds duplicate code, both exact copies and near-misses, across large source systems, parameterized by langauge syntax. It supports Java, C#, COBOL, C++, PHP, Python and many other languages.

It accepts a number of parameters to define "What is a clone?", including: a) Similarilty threshold, controlling how similar two blocks of code must be to be declared as clones (typically 95% is good) b) number of lines minimum clone size (3 tends to be a good choice) c) number of parameters (distinct changes to the text; 5 tends to be a good choice) With these settings, it tends to find 10-15% redundant code in virturally everything it processes.

Line-oriented clone detection tools such as Simian can't find cloned code that has been reformatted, but CloneDR will. They may tell that two blocks of code match, but they usually don't show you exactly how they match or where the differences are; CloneDR will. They don't suggest how to abstract the cloned code; CloneDR will.

By virtue of having weaker matching algorithms, they tend to produce more false positives; when you get 5000 clones reported across a million lines, the number of false positives matters a lot.

Based on your example, I'd expect it to find those two fragments (you don't have have point to either one) and note that they are similar if you abstract away the variable names.

Ira Baxter
A: 

The two instances of similar code you gave as example are typical for what happens during "cut'n paste" development. If you also split the statements on multiple lines, and allow some code to be inserted in between, maybe some additional comments too, it may be difficult even for a reader to spot the similarities. However, some tools can cope with it. I recommend the Source Code Duplication Detector (SolidSDD).

Lucian Voinea