views:

290

answers:

8

If I wanted to learn about pattern recognition in general what would be a good place to start (recommend a book)?

Also, does anybody have any experience/knowledge on how to go about applying these algorithms to find abstraction patterns in programs? (repeated code, chunks of code that do the same thing, but in slightly different ways, etc.)

Thanks

Edit: I don't mind mathematically intensive books. In fact, that would be a good thing.

A: 

I'd suggest looking at the code of some open source project (e.g. FindBugs or SIM) that does the kind of thing you're talking about.

joel.neely
A: 

If you're working in one of the supported languages, IntelliJ idea has a really smart structural search and replace that would fit your problem.

krosenvold
A: 

Other interesting projects are PMD and Eclipse.

Eclipse uses AST (abstract syntax trees) for all source code in any project. Tools can then register for certain types of ASTs (like Java source) and get a preprocessed view where they can add additional information (like links to documentation, error markers, etc).

Aaron Digulla
+1  A: 

It helps if you have access to the parse tree generated during compilation. This way you can look for pieces of the tree which are similar, ignoring the nodes which are deeper than what you are looking at, this way you can pick out e.g. nodes which multiply together two sub-expressions, ignoring the contents of the sub-expressions. You can apply the same logic to a collection of nodes, e.g. you want to find a multiplication of two sub-expressions where those two sub-expressions are additions of more sub-expressions. You first look for multiplies, then check if the two nodes underneath the multiply are additions, ignoring anything any deeper.

jheriko
"Look for pieces of the tree which are similar". Its one thing to say this, quite another to implement it effectively. See CloneDR answer, which does exactly this.
Ira Baxter
+2  A: 

If you are reasonably mathematically confident then either of Chris Bishop's books "Pattern Recognition and Machine Learning" or "Neural Networks for Pattern Recognition" are very good for learning about pattern recognition.

Ian G
I don't think these would be very effective on source code, especially large systems. How many inputs does a neural net need to inspect a million lines of code? How many output nodes would it have, and what would exatly would they recognize? See CloneDR answer for a practical tool that does this well (by not using neural nets).
Ira Baxter
Quite possibly true, the representation problem is not going to be trivial. But on the other hand using the product that you sell isn't going to help the asker learn about general pattern recognition. :-)
Ian G
The representation problem on this kind of scale seems to me frankly unsolvable. My neurons can't do this. Regarding learning how an existing tool, CloneDR, does it, he can learn that by reading the technical paper about the tool at the web site.
Ira Baxter
A: 

Another project you can look into is Duplo - it's an open-source/GPL project, so you can pore over their approach by grabbing the code from SourceForge.

mskfisher
A: 

This is specific to .Net and visual studio, but it finds duplicate code in your project. It does report some false positives I've found but it could be a good place to start.

Clone Detective

TWith2Sugars
A: 

One kind of pattern is code that has been cloned by copy and paste methods. See CloneDR for a tool that automatically finds such code in spite of variations in layout and even changes in the body of the clone, by comparing abstract syntax trees for the language in question.

CloneDR works with a variety of langauges: C, C++, C#, Java, JavaScript, PHP, COBOL, Python, ... The website shows clone detection reports for a variety of programming languages.

Ira Baxter