views:

700

answers:

13

I have being reading and tracking some questions on code reuse and I have this question:

Are there any tools to identify duplicate or similar code?

I have googled this a while ago and found nothing good.

+6  A: 

Simian

Simian - Similarity Analyser

Purpose

Simian (Similarity Analyser) identifies duplication in Java, C#, C, C++, COBOL, Ruby, JSP, ASP, HTML, XML, Visual Basic, Groovy source code and even plain text files. In fact, simian can be used on any human readable files such as ini files, deployment descriptors, you name it.

Especially on large enterprise projects, it can be difficult for any one developer to keep track of all the features (classes, methods, etc.) of the system.

Marc Gravell
holy cow they want a lot of money!
Malfist
A: 

For Java, there is JTest which can do code duplication detection.

Elie
+3  A: 

Code Coverage, Inspections and Duplicates Search is a feature of TeamCity's Code Quality features.

I use TeamCity personally and I really like it. It does support .NET and Java.

Tomh
A: 

There is tool for Python and Java: http://clonedigger.sourceforge.net/

bialix
+11  A: 

For .NET, you can get CloneDetective, it's a free plugin for VS. C# only, but the underlying technology supports various languages.

leppie
Thanks Shog :) Corporate internet too slow to Google!
leppie
Not a problem. Thanks for posting this, ConQAT looks interesting...
Shog9
+2  A: 

See our clone detector that works for C, C++, C#, Java, COBOL, VB6, PHP and many other languages can be seen at: http://www.semdesigns.com/Products/Clone/index.html It finds exact and near-miss clones, so it will detect clones that have been parameterized by editing.

It works by matching language structures, not text lines or tokens, so the reported clones look like code structures. Line-based clone detection can't match clones that that have been reformatted, have white space changes, or in which the comments have changed. Token based detectors often find clones which make no sense, such as

    }  {

which occur huge numbers of times in the text, but are clones only in the dumbest sense of the word.

See an example of detected clones. There are several other clone detector reports for various langauges there.

EDIT 3/25/2010: ... now does Python ...

EDIT 8/5/2010: ... now does EGL ...

EDIT 10/22/2010: ... now does VBScript and VB.net ...

Ira Baxter
+1  A: 

Same (http://sourceforge.net/projects/same/) is extremely plain, but it works on text lines instead of tokens, which is useful if you're using a language that isn't supported by one of the fancier clone finders.

Sean McMillan
True, but this will only find at best exact clones. Most of the interesting ones are those that have been edited ("copy-paste-edit").
Ira Baxter
Well, sure. But in my experience, copy-paste-edit is copy-paste 20 lines, and edit 1. Same will point you in the right direction there. Certainly, use a tokenizing tool of you've got one, but it's a heck of a lot better than moaning that pmd/cpd doesn't parse visual basic.
Sean McMillan
When you run clone detection on a million lines, you end with 5000 detected sets of parameterized (near-miss) clones with small number of "edits" in the middle. For the sake of argument, lets agree to limit this to one. Then each clone detected by parameterized detector will show up as two identical clones in an exact detector very near each other. You'll end up looking at 10,000 clones. So yes, an exact matcher is better than none, but if you want to minimize the spent analyzing them, find one that does parameters.
Ira Baxter
How is this better than duploc?
Brian Carlton
duploc is not mentioned here, and Googling it doesn't seem to give a sensible result. Do you have a URL for it?
Sean McMillan
+1  A: 

I have written a duplication detector. It is written in Python and based on "pygments lexer". Hence works on all languages supported by pygments. Check Thinking Craftsman Toolkit. Setup/install is not available yet you have to get the source from svn. See if it works for you.

Nitin Bhide
A: 

A good complement to PMD is CheckStyle and JDepend.

Kelly French
+1  A: 

Check out CCFinder. It has an interesting graphical user interface. It shows you your duplicate code in an interactive scatter plot.

Kurt W. Leucht
A: 

If you need a good tool you have to look for something that detects similar code and not perfect (i.e., identical) matches. Such a tool should:

  • cope with a different formatting of the source code (white spaces and comments);
  • recognize identifier renaming (e.g., adjusting variable names);
  • allow insertion/deletion of lines of code across the duplicated code snippets.

The tool I recommend for the job is the Source Code Duplication Detector (SolidSDD). Via the included visualization and reporting features it makes the detection results relevant not only for developers, but also for architects and managers.

Lucian Voinea
A: 

Perhaps you could use MOSS to determine similar parts of your program.

ceretullis
+1  A: 

While not its primary usage, PyLint can report possible duplicated code:

Similarities checker

checks for similarities and duplicated code. This computation may be memory / CPU intensive, so you should disable it if you experiments some problems.

Options

min-similarity-lines:
    Minimum lines number of a similarity. Default: 4
ignore-comments:
    Ignore comments when computing similarities. Default: yes
ignore-docstrings:
    Ignore docstrings when computing similarities. Default: yes
dbr