I'm working on a group project for my University which is going to be used for plagiarism detection in Computer Science.
My group is primarily going off the hashing/fingerprinting techniques described in this journal article: Winnowing: Local Algorithms for Document Fingerprinting. This is very similar to how the MOSS plagiarism detection system works.
We are basically taking k-gram hashes of fellow students source code and looking them up in a database for relevant matches (along with lots of optimization in how we determine which hashes to select as a document's fingerprints).
The first aspect of our project is the "Front-End" portion of it, which will hold some semantic knowledge about each file format our detection system can process. This will allow us to strip some details from the document that we no longer want for the purpose of plagiarism detection. Basically we want to be able to rename all variables in various programming languages to a constant string or letter.
What is a lightweight solution (lexer generator or something similar) that we can use to aid in renaming all variables in different languages source code files to constants?
Our project is being written in Java.
Ideally I'd simply like to be able to define a grammar for each language and then our front end will be able rename all identifiers in that languages source file to some constant. We would then do this for each file format we wanted to support (java, c++, python, etc).