views:

464

answers:

3

Hello, I am facing a problem on developing my web app, here is the description:

This webapp (still in alpha) is based on user generated content (usually short articles although their length can become quite large, about one quarter of screen), every user submits at least 10 of these articles, so the number should grow pretty fast. By nature, about 10% of the articles will be duplicated, so I need an algorithm to fetch them.

I have come up with the following steps:

  1. On submission fetch a length of text and store it in a separated table (article_id,length), the problem is the articles are encoded using PHP special_entities() function, and users post content with slight modifications (some one will miss the comma, accent or even skip some words)
  2. Then retrieve all the entries from database with length range = new_post_length +/- 5% (should I use another threshold, keeping in mind that human factor on articles submission?)
  3. Fetch the first 3 keywords and compare them against the articles fetched in the step 2
  4. Having a final array with the most probable matches compare the new entry using PHP's levenstein() function

This process must be executed on article submission, not using cron. However I suspect it will create heavy loads on the server.

Could you provide any idea please?

Thank you! Mike

A: 

I'd like to point out that git, the version control system, has excellent algorithms for detecting duplicate or near-duplicate content. When you make a commit, it will show you the files modified (regardless of rename), and what percentage changed.

It's open source, and largely written in small, focused C programs. Perhaps there is something you could use.

gahooa
A: 

You could design your app to reduce the load by not having to check text strings and keywords against all other posts in the same category. What if you had the users submit the third party content they are referencing as urls? See Tumblr implementation-- basically there is a free-form text field so each user can comment and create their own narrative portion of the post content, but then there are formatted fields also depending on the type of reference the user is adding (video, image, link, quote, etc.) An improvement on Tumblr would be letting the user add as many/few types of formatted content as they want in any given post.

Then you are only checking against known types like a url or embed video code. Combine that with rexem's suggestion to force users to classify by category or genre of some kind, and you'll have a much smaller scope to search for duplicates.

Also if you can give each user some way of posting to their own "stream" then it doesn't matter if many people duplicate the same content. Give people some way to vote up from the individual streams to a main "front page" level stream so the community can regulate when they see duplicate items. Instead of a vote up/down like Digg or Reddit, you could add a way for people to merge/append posts to related posts (letting them sort and manage the content as an activity on your app rather than making it an issue of behind the scenes processing).

KellyRued
+1  A: 

Text similarity/plagiat/duplicate is a big topic. There are so many algos and solutions.

Lenvenstein will not work in your case. You can only use it on small texts (due to its "complexity" it would kill your CPU).

Some projects use the "adaptive local alignment of keywords" (you will find info on that on google.)

Also, you can check this (Check the 3 links in the answer, very instructive):

http://stackoverflow.com/questions/945724/cosine-similarity-vs-hamming-distance/1290286#1290286

Hope this will help.

Toto