tags:

views:

611

answers:

6

First, please note, that I am interesting in how something like this would work, and am not intending to build it for a client etc, as I'm sure there may already be open source implementations.

How do the algorithms work which detect plagiarism in uploaded text? Does it use regex to send all words to an index, strip out known words like 'the', 'a', etc and then see how many words are the same in different essays? Does it them have a magic number of identical words which flag it as a possible duplicate? Does it use levenshtein()?

My language of choice is PHP.

UPDATE

I'm thinking of not checking for plagiarism globally, but more say in 30 uploaded essays from a class. In case students have gotten together on a strictly one person assignment.

Here is an online site that claims to do so: http://www.plagiarism.org/

+3  A: 

It really depends on "plagarised from where". If you are talking about within the context of a single site, that's vastly different from across the web, or the library of congres, or ...

http://www.copyscape.com/ pretty much proves it can be done.

Basic concept seems to be

  • do a google search for some uncommon word sequences
  • For each result, do a detailed analysis

The detailed analysis portion can certainly be similar, since it is a 1 to 1 comparison, but locating and obtaining source documents is the key factor.

Roger Willcocks
Pretty smart using Google. I do this from time to time when I need to locate a source document and all I have is a sentence or two.
Jonathan Sampson
@Jonathan - same, or when I have some lyrics of a song and can't remember the artist or title. @Roger - Is scraping Google results legal and/or ethical? Is there a better way?
alex
If you were doing it on a "professional" basis, you can sign up for the dev application tokens, etc and use the APIs they provide.Fairly generous terms for getting starting.
Roger Willcocks
+12  A: 

Good plagiarism detection will apply heuristics based on the type of document (e.g. an essay or program code in a specific language).

However, you can also apply a general solution. Have a look at the Normalized Compression Distance (NCD). Obviously you cannot exactly calculate a text's Kolmogorov complexity, but you can approach it be simply compressing the text.

A smaller NCD indicates that two texts are more similar. Some compression algorithms will give better results than others. Luckily PHP provides support for several compression algorithms, so you can have your NCD-driven plagiarism detection code running in no-time. Below I'll give example code which uses Zlib:

PHP:

function ncd($x, $y) { 
  $cx = strlen(gzcompress($x));
  $cy = strlen(gzcompress($y));
  return (strlen(gzcompress($x . $y)) - min($cx, $cy)) / max($cx, $cy);
}   

print(ncd('this is a test', 'this was a test'));
print(ncd('this is a test', 'this text is completely different'));

Python:

>>> from zlib import compress as c
>>> def ncd(x, y): 
...     cx, cy = len(c(x)), len(c(y))
...     return (len(c(x + y)) - min(cx, cy)) / max(cx, cy) 
... 
>>> ncd('this is a test', 'this was a test')
0.30434782608695654
>>> ncd('this is a test', 'this text is completely different')
0.74358974358974361

Note that for larger texts (read: actual files) the results will be much more pronounced. Give it a try and report your experiences!

Stephan202
+1. I read about compression being used to measure information content a long time ago but your post brought back this interesting idea.
Noufal Ibrahim
+3  A: 

I think that this problem is complicated, and doesn't have one best solution. You can detect exact duplication of words at the whole document level (ie someone downloads an entire essay from the web) all the way down to the phrase level. Doing this at the document level is pretty easy - the most trivial solution would take the checksum of each document submitted and compare it against a list of checksums of known documents. After that you could try to detect plagiarism of ideas, or find sentences that were copied directly then changed slightly in order to throw off software like this.

To get something that works at the phrase level you might need to get more sophisticated if want any level of efficiency. For example, you could look for differences in style of writing between paragraphs, and focus your attention to paragraphs that feel "out of place" compared to the rest of a paper.

There are lots of papers on this subject out there, so I suspect there is no one perfect solution yet. For example, these 2 papers give introductions to some of the general issues with this kind of software,and have plenty of references that you could dig deeper into if you'd like.

http://ir.shef.ac.uk/cloughie/papers/pas_plagiarism.pdf

http://proceedings.informingscience.org/InSITE2007/IISITv4p601-614Dreh383.pdf

Peter Recore
Thanks for the links Peter - +1
alex
A: 

What you should be more worried about is finding and evolving the students who care about their work. Life will filter out the lazy and the cheaters; computers can't do that.

dudez
+1  A: 

For better results on not-so-big strings:

There are problems with the direct uso of the NCD formula on strings or little texts. NCD(X,X) is not zero (!). To remove this artifact subtract the self comparison.

See similar_NCD_gzip() demo at http://leis.saocarlos.sp.gov.br/SIMILAR.php

function similar_NCD_gzip($sx, $sy, $prec=0, $MAXLEN=90000) {
# NCD with gzip artifact correctoin and percentual return.
# sx,sy = strings to compare. 
# Use $prec=-1 for result range [0-1], $pres=0 for percentual,
#     $pres=1 or =2,3... for better precision (not a reliable)  
# Use MAXLEN=-1 or a aprox. compress lenght. 
# For NCD definition see http://arxiv.org/abs/0809.2553
# (c) Krauss (2010).
  $x = $min = strlen(gzcompress($sx));
  $y = $max = strlen(gzcompress($sy));
  $xy= strlen(gzcompress($sx.$sy));
  $a = $sx;
  if ($x>$y) { # swap min/max
    $min = $y;
    $max = $x;
    $a = $sy;
  }
  $res = ($xy-$min)/$max; # NCD definition.

  # Optional correction (for little strings):
  if ($MAXLEN<0 || $xy<$MAXLEN) {
    $aa= strlen(gzcompress($a.$a));
    $ref = ($aa-$min)/$min;
    $res = $res - $ref; # correction
  }
  return ($prec<0)? $res: 100*round($res,2+$prec);
}
Peter Krauss
+1  A: 

Well, you first of all have to understand what you're up against.

Word-for-word plagiarism should be ridiculously easy to spot. The most naive approach would be to take word tuples of sufficient length and compare them against your corpus. The sufficient length can be incredibly low. Compare Google results:

"I think" => 454,000,000
"I think this" => 329,000,000
"I think this is" => 227,000,000
"I think this is plagiarism" => 5

So even with that approach you have a very high chance to find a good match or two (fun fact: most criminals are really dumb).

If the plagiarist used synonyms, changed word ordering and so on, obviously it gets a bit more difficult. You would have to store synonyms as well and try to normalise grammatical structure a bit to keep the same approach working. The same goes for spelling, of course (i.e. try to match by normalisation or try to account for the deviations in your matching, as in the NCD approaches posted in the other answers).

However the biggest problem is conceptual plagiarism. That is really hard and there are no obvious solutions without parsing the semantics of each sentence (i.e. sufficiently complex AI).

The truth is, though, that you only need to find SOME kind of match. You don't need to find an exact match in order to find a relevant text in your corpus. The final assessment should always be made by a human anyway, so it's okay if you find an inexact match.

Plagiarists are mostly stupid and lazy, so their copies will be stupid and lazy, too. Some put an incredible amount of effort into their work, but those works are often non-obvious plagiarism in the first place, so it's hard to track down programmatically (i.e. if a human has trouble recognising plagiarism with both texts presented side-by-side, a computer most likely will, too). For all the other 80%-or-so, the dumb approach is good enough.

Alan