Hello, I am facing a problem on developing my web app, here is the description:
This webapp (still in alpha) is based on user generated content (usually short articles although their length can become quite large, about one quarter of screen), every user submits at least 10 of these articles, so the number should grow pretty fast. By nature, about 10% of the articles will be duplicated, so I need an algorithm to fetch them.
I have come up with the following steps:
- On submission fetch a length of text and store it in a separated table (
article_id
,length), the problem is the articles are encoded using PHP special_entities() function, and users post content with slight modifications (some one will miss the comma, accent or even skip some words) - Then retrieve all the entries from database with length range =
new_post_length
+/- 5% (should I use another threshold, keeping in mind that human factor on articles submission?) - Fetch the first 3 keywords and compare them against the articles fetched in the step 2
- Having a final array with the most probable matches compare the new entry using PHP's levenstein() function
This process must be executed on article submission, not using cron. However I suspect it will create heavy loads on the server.
Could you provide any idea please?
Thank you! Mike