What is a suitable lexer generator that I can use to strip identifiers from many language source files?

I'm working on a group project for my University which is going to be used for plagiarism detection in Computer Science.

My group is primarily going off the hashing/fingerprinting techniques described in this journal article: Winnowing: Local Algorithms for Document Fingerprinting. This is very similar to how the MOSS plagiarism detection system works.

We are basically taking k-gram hashes of fellow students source code and looking them up in a database for relevant matches (along with lots of optimization in how we determine which hashes to select as a document's fingerprints).

The first aspect of our project is the "Front-End" portion of it, which will hold some semantic knowledge about each file format our detection system can process. This will allow us to strip some details from the document that we no longer want for the purpose of plagiarism detection. Basically we want to be able to rename all variables in various programming languages to a constant string or letter.

What is a lightweight solution (lexer generator or something similar) that we can use to aid in renaming all variables in different languages source code files to constants?

Our project is being written in Java.

Ideally I'd simply like to be able to define a grammar for each language and then our front end will be able rename all identifiers in that languages source file to some constant. We would then do this for each file format we wanted to support (java, c++, python, etc).

Interesting. I know that MOSS lists that it supports TCL, Lisp, Scheme, and in their paper they describe they do this front-end parsing. Perhaps they do less modification to those languages source files.

Simucal 2010-01-22 20:02:55

Its possible that they just make certain assumptions about where a variable will and will not be. For example, in the code "set b 1", you can probably assume "b" is a variable name. It doesn't "have" to be, but it's unlikely that anyone has redefined the "set" command.

RHSeeger 2010-01-23 16:10:26

ansaurus

tags:

views:

answers:

What is a suitable lexer generator that I can use to strip identifiers from many language source files?

related questions