views:

122

answers:

3

hi.. i am doing a project on SOFWARE PLAGIARISM DETECTION..i am intended to do it with language C..for that i am supposed to create a token generator, and a parser..but i dont know where to start..any one can help me out with this..

i created a database of tokens and i separated the tokens from my program.Next thing i wanna do is to compare two programs to find out whether it's plagiarized or not. For that i need to create a syntax analyzer.I don't know where to start from...

i.e I want to create a parser for c programs in python

+2  A: 

If you want to create a parser in Python you can look at these libraries:
PLY
pyparsing
and Lepl - new but very powerful

rubik
These are good idea only if OP defines a very simple model of C, which for an academic project might be OK.
Ira Baxter
+1  A: 

Building a real C parser by yourself is a really big task.

I suggest you either find one that is already done, eg. pycparser or you define a really simple subset of C that is easily parsed.

You'll have plenty of work to do for your plagiarism detector after you are done parsing C.

Ira Baxter
+1 - parsing is the easy part.
Corbin March
Ira Baxter
A: 

I'm not sure you need to parse the token stream to detect the features you're looking for. In fact, it's probably going to complicate things more than anything.

what you're really looking for is sequences of original source code that have a very strong similarity with a suspect sample code being tested. This sounds very similar to the purpose of a Bayes classifier, like those used in spam filtering and language detection.

TokenMacGuy
Depends on the purpose of his detector. If you want good answers for plagiarism on C source code, you need to do this in a way that is independent of formatting. Comparing "lines of text" won't do this; so, you need something that isn't lines. Tokens is a useful grain for doing this. Better are abstract syntax trees, which is what OP appears to be fishing for; see my answer for a reference to technical paper that does just exactly this.
Ira Baxter