ansaurus

Question

How can I keep track of original character positions in a string across transformations?

Answer 1

+1 A:

The ANTLR lexers keep track of token positions in the source stream.

Move comments and whitespace to the hidden channel
Set the Text property of identifier tokens to "V"
Run your rolling hash against a CommonTokenStream, looking at the Text property of each token.

With the tokens intact from start to end, you'll have the mapping preserved as well.

280Z28 2010-01-25 07:48:08

Answer 2

A:

Hey, why are using this step:

This string is then broken down into k-grams of a preset size. For example say k = 5 (in reality it would be larger): publi ublic blicc liccl iccla ... =0;}}

I mean why is this required for Plagiarism Detection?

Harsh Gidra 2010-02-27 08:02:40

Read the PDF link I gave above. Basically, by splitting the source code into k-grams and hashing them you can detect matches between documents despite re-ordering and whitespace.

Simucal 2010-02-27 18:17:18

ansaurus

tags:

views:

answers:

How can I keep track of original character positions in a string across transformations?

related questions