views:

56

answers:

3

I'm working on a group project for my University which is going to be used for plagiarism detection in Computer Science.

My group is primarily going off the hashing/fingerprinting techniques described in this journal article: Winnowing: Local Algorithms for Document Fingerprinting. This is very similar to how the MOSS plagiarism detection system works.

We are basically taking k-gram hashes of fellow students source code and looking them up in a database for relevant matches (along with lots of optimization in how we determine which hashes to select as a document's fingerprints).

The first aspect of our project is the "Front-End" portion of it, which will hold some semantic knowledge about each file format our detection system can process. This will allow us to strip some details from the document that we no longer want for the purpose of plagiarism detection. Basically we want to be able to rename all variables in various programming languages to a constant string or letter.

What is a lightweight solution (lexer generator or something similar) that we can use to aid in renaming all variables in different languages source code files to constants?

Our project is being written in Java.

Ideally I'd simply like to be able to define a grammar for each language and then our front end will be able rename all identifiers in that languages source file to some constant. We would then do this for each file format we wanted to support (java, c++, python, etc).

+3  A: 

For a lexer/parser generator, you should look at ANTLR. TXL, which is a textual transformation interpreter, is also worth a look. Ready-made grammars should be available for both.

Mick Sharpe
ANTLR is really nice and has great tooling: http://www.antlr.org/works
Fabian Steeg
A: 

Be aware that there are some languages where it's not really possible to do what you're trying to do. Specifically, those where it's not possible to determine what is or isn't a variable based on the grammar. Tcl is an example of such, but there are a number of dynamic languages that have the same issue (Lisp?).

RHSeeger
Interesting. I know that MOSS lists that it supports TCL, Lisp, Scheme, and in their paper they describe they do this front-end parsing. Perhaps they do less modification to those languages source files.
Simucal
Its possible that they just make certain assumptions about where a variable will and will not be. For example, in the code "set b 1", you can probably assume "b" is a variable name. It doesn't "have" to be, but it's unlikely that anyone has redefined the "set" command.
RHSeeger
+1  A: 

Apart from ANTLR, which was already suggested, you can also take a look at JFlex.

MAK