Hi everyone all over the world,
Background
I am a final year student of Computer Science. I've proposed my Final Double Module Project which is a Plagiarism Analyzer, using Java and MySQL.
The Plagiarism Analyzer will:
- Scan all the paragraphs of uploaded document. Analyze percentage of each paragraph copied from which website.
- Highlight only the words copied exactly from which website in each paragraph.
My main objective is to develop something like Turnitin, improved if possible.
I have less than 6 months to develop the program. I have scoped the following:
- Web Crawler Implementation. Probably will be utilizing Lucene API or developing my own Crawler (which one is better in terms of time development and also usability?).
- Hashing and Indexing. To improve on the searching and analyzing.
Questions
Here are my questions:
- Can MySQL store that much information?
- Did I miss any important topics?
- What are your opinions concerning this project?
- Any suggestions or techniques for performing the similarity analysis?
- Can a paragraph be hashed, as well as words?
Thanks in advance for any help and advice. ^^