I'm experimenting a bit with textual comparison/basic plagiarism detection, and want to try this on a website-to-website basis. However, I'm a bit stuck in finding a proper way to process the text.
How would you process and compare the content of two websites for plagiarism?
I'm thinking something like this pseudo-code:
// extract text
foreach website in websites
crawl website - store structure so pages are only scanned once
extract text blocks from all pages - store this is in list
// compare
foreach text in website1.textlist
compare with all text in website2.textlist
I realize that this solution could very quickly accumulate a lot of data, so it might only be possible to make it work with very small websites.
I haven't decided on the actual text comparison algorithm yet, but right now I'm more interested in getting the actual process algorithm working first.
I'm thinking it would be a good idea to extract all text as individual text pieces (from paragraphs, tables, headers and so on), as text can move around on pages.
I'm implementing this in C# (maybe ASP.NET).
I'm very interested in any input or advice you might have, so please shoot! :)