views:

124

answers:

3

if anyone has ever submitted a story to digg, it checks whether or not the story is already submitted, I assume by a fuzzy search.

I would like to implement something similar and want to know if they are using a php class that is open source?

Soundex isnt doing it, sentences/strings can be up to 250chars in length

A: 

You could (depending on the size of your dataset) use mySQL's FULLTEXT search, and look for item(s) that have a high score and are within a certain timeframe, and suggest this/these to the user.

More about score here: http://stackoverflow.com/questions/230129/mysql-fulltext-search-score-explained

Pete
Maths isnt my strong point
chris
Unfortunately, programming is lot to do with mathematics.
Pete
A: 
pp19dd
+1  A: 

I would suggest taking the users submitted URLs and storing them in multiple parts; domain name, path and query string. Use the PHP parse_url() function to derive the parts of the submitted URL.

Index at least the domain name and path. Then, when a new user submits URL you search your database for a record matching the domain and path. Since the columns are indexed, you will be filtering out first all records that are not in the same domain, and then searching through the remaining records. Depending on your dataset, this should be faster that simply indexing the entire URL. Make sure your WHERE clause is setup in the right order.

If that does not meet your needs I would suggest trying Sphinx. Sphinx is an open source SQL full text search engine that is far faster that MySQL's built in full-text search. It supports stemming and some other nice features.

http://sphinxsearch.com/

You could also take the title or text content of the users submission, run it through a function to generate keywords, and search the database for existing records with those or similar keywords.

John Kramlich