Plagiarism Analyzer (compared against Web Content)

views:

343

answers:

+2 Q:

Plagiarism Analyzer (compared against Web Content)

Hi everyone all over the world,

Background

I am a final year student of Computer Science. I've proposed my Final Double Module Project which is a Plagiarism Analyzer, using Java and MySQL.

The Plagiarism Analyzer will:

Scan all the paragraphs of uploaded document. Analyze percentage of each paragraph copied from which website.
Highlight only the words copied exactly from which website in each paragraph.

My main objective is to develop something like Turnitin, improved if possible.

I have less than 6 months to develop the program. I have scoped the following:

Web Crawler Implementation. Probably will be utilizing Lucene API or developing my own Crawler (which one is better in terms of time development and also usability?).
Hashing and Indexing. To improve on the searching and analyzing.

Questions

Here are my questions:

Can MySQL store that much information?
Did I miss any important topics?
What are your opinions concerning this project?
Any suggestions or techniques for performing the similarity analysis?
Can a paragraph be hashed, as well as words?

Thanks in advance for any help and advice. ^^

+2 A:

Read about MySQL's benchmarks.
What other services can you use?
Google and other search engines already index the web. Make use of them.
http://google.com/search?q=similarity+analysis+plagiarism
You can create a hash for any string of characters.
http://en.wikipedia.org/wiki/Plagiarism_detection

Dave Jarvis 2009-10-14 16:24:11

@Mr CooL, Make sure you don't copy things from these websites. Your program might flag your own work as plagiarised

Pasta 2009-10-14 16:39:44

Hi Dave Jarvis,Thanks a lot for all the info you have provided here. Your helps are greatly appreciated. ;)

Mr CooL 2009-11-02 09:02:49

http://meta.stackoverflow.com/search?q=lmgtfy

Robert Harvey 2009-11-06 22:19:34

+1 A:

1) Make your own web crawler ? looks like you can easily use all your available time just for this task. Try using a standard solution for that : it's not the heart of your program.

You still will have the opportunity to make your own or try another one afterwards (if you have time left !). Your program should work only on local files so as not to be tied to a specific crawler/API.

Maybe you'll even have to use different crawlers for different sites

2) Hashing whole paragraphs is possible. You can just hash any string. But of course that means you can only check for whole paragrpahs copied exactly. Maybe sentences would be a better unit to test. You probably should "normalize" (tranform) the sentences/paragrpahs before hashing to sort out minor differences like uppercase/lowercase.

3) MySQL can store a lot of data.

The usual advice is : stick to standard SQL. If you discover you have way too much data you will still have the possibility to use another SQL implementation.

But of course if you have too much data, start by looking at ways to reduce it or at least to reduce what's in mySQL. for example you could store hashes in MySQL but original pages (if needed) in plain files.

siukurnin 2009-10-14 16:26:36

Hi Siukurnin, thanks a lot for the advices and issues highlighted. All the information and advices given will be taken into considerations when I start developing my proposed system.

Mr CooL 2009-11-02 09:07:05

+3 A:

Have you considered another project that isn't doomed to failure on account of lack of resources available to you?

If you really want to go the "Hey, let's crawl the whole web!" route, you're going to need to break out things like HBase and Hadoop and lots of machines. MySQL will be grossly insufficient. TurnItIn claims to have crawled and indexed 12 billion pages. Google's index is more like [redacted]. MySQL, or for that matter, any RDBMS, cannot scale to that level.

The only realistic way you're going to be able to pull this off is if you do something astonishingly clever and figure out how to construct queries to Google that will reveal plagiarism of documents that are already present in Google's index. I'd recommend using a message queue and access the search API synchronously. The message queue will also allow you to throttle your queries down to a reasonable rate. Avoid stop words, but you're still looking for near-exact matches, so queries should be like: "* quick brown fox jumped over * lazy dog" Don't bother running queries that end up like: "* * went * * *" And ignore results that come back with 94,000,000 hits. Those won't be plagiarism, they'll be famous quotes or overly general queries. You're looking for either under 10 hits or a few thousand hits that all have an exact match on your original sentence or some similar metric. And even then, this should just be a heuristic — don't flag a document unless there are lots of red flags. Conversely, if everything comes back as zero hits, they're being unusually original. Book search typically needs more precise queries. Sufficiently suspicious stuff should trigger HTTP requests for the original pages, and final decisions should always be the purview of a human being. If a document cites its sources, that's not plagiarism, and you'll want to detect that. False positives are inevitable, and will likely be common, if not constant.

Be aware that the TOS prohibit permanently storing any portion of the Google index.

Regardless, you have chosen to do something exceedingly hard, no matter how you build it, and likely very expensive and time-consuming unless you involve Google.

Bob Aman 2009-10-14 17:50:04

Also, hits for Wikipedia pages are more of a red flag than others, and at least with Wikipedia, it's reasonable to download the entire content and process it directly.

Bob Aman 2009-10-14 20:58:29

Thanks a million Bob Aman for all the advices given and significant issues highlighted. Your kindness is really greatly appreciated.Well Bob, there is no U turn for me regarding the project that I've proposed. I will try my best as I've made up my mind of challenging this. I hope my dream will be realized. I always want to do something within my interest and it helps me to improve my skills.I'm going to explore the Google Search API. I'm just wondering whether any restrictions by Google because I'm considering of using Lucene (Java open source search API) instead. Thanks again! ;)

Mr CooL 2009-11-02 08:57:15

By the way, Bob Aman, in the future of these 6 months, if I have any problems (which something I would've tried very hard), can I ask for your guidelines at here?Really thanks once again for your willingness of sharing your knowledge. Sorry for replying late also due to my hectic of past few weeks.

Mr CooL 2009-11-02 09:00:05

Nah, Google doesn't care if you use Lucene, so long as you're not somehow copying their index into it. There is a terms of service agreement for the Google Search API. It's a rare TOS must-read. And yeah, just reply to this answer again and I should see it.

Bob Aman 2009-11-02 15:54:11

Oh I see. I've actually completed almost 60% documentation of this project which soon I can fully concentrate on the development. I will read the rare TOS must-read. haha..Thank you so much for the helps up to here. ;)

Mr CooL 2009-11-02 18:17:44

Online code is usually distributed under OpenSource licenses. And most of code is just tutorials. According to your logic, copying anything from any website is plagiarism. Which means you can not accept and use any answer you get here. If you really want to finish your project, just write a system that would compare code from students in the same class and previous classes. It is much more efficient. An example of such a system is MOSS (there's also a paper talking about how it works). This thing is really efficient without any web crawlers.

tulskiy 2009-10-15 00:04:36

Yeah Piligrim, I'm very aware of that,,,just that happened to be I've proposed the Plagiarism Analyzer which compared against the web contents. Hence, I can't change my scope. Thanks anyway for your suggestions and information. ;)

Mr CooL 2009-11-02 09:15:51

ansaurus

tags:

views:

answers:

Plagiarism Analyzer (compared against Web Content)

related questions