views:

131

answers:

3

I have a site that needs to search thru about 20-30k records, which are mostly movie and TV show names. The site runs php/mysql with memcache.

Im looking to replace the FULLTEXT with soundex() searching that I currently have, which works... kind of, but isn't very good in many situations.

Are there any decent search scripts out there that are simple to implement, and will provide a decent searching capability (of 3 columns in a table).

A: 

There is a function SOUNDEX in mysql. If you want to search for a movie title :

select * from movie where soundex(title) = soundex( 'the title' );

Of course it doesn't work to search in text, such as movie or plot summary.


Soundex is a relatively simple algo. You can also decide to handle all that at the applicative level, it may be easier:

  • when text is stored, tokenize it and apply soundex on all words
  • store the original text and the soundex version in two columns
  • when you search, compute the soundex at the app. level and then use a regular LIKE at the db level.
ewernli
I'm intrigued. If film.title is "Apocalypse Now" but user searches "Apocalypse Cow", then the soundex might well find a match doing '... where film.title="$user_title" OR soundex( film.title ) = soundex( $user_title)'; But this would not work in say, film.intro var(255) which contained "Martin Sheen" and user searched for "Martin Shean" have I understood?Sorry if this Q/comment is in wrong place, please put me right if so.
Cups
soundex('A movie with Martin Sheen') -> A513563525,select soundex('A'), soundex('movie'), soundex('with'), soundex('Martin'),soundex('Sheen'); --> A000, M100, W300, M635, S500 So if you store the soundex version of the text 'A000 M100 W300 M635 S500' and search with LIKE '%M635%S500%' that might be ok.This is however still not optimal. Searching for 'Mart Insheen' (possible misunderstanding of the name) yield LIKE "%M630%I525%' and would not work.
ewernli
+1  A: 

If you are looking for a simple existing solution instead of creating your own solution check out

Jeremiah Stover
+2  A: 

ewemli's answer is in the right direction but you should be combining FULLTEXT and soundex mapping, not replacing the fulltext, otherwise your LIKE queries are likely be very slow.

create table with_soundex (
  id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
  original TEXT,
  soundex TEXT,
  FULLTEXT (soundex)
);

insert into with_soundex (original, soundex) values 

('add some test cases', CONCAT_WS(' ', soundex('add'), soundex('some'), soundex('test'), soundex('cases'))),
('this is some text', CONCAT_WS(' ', soundex('this'), soundex('is'), soundex('some'), soundex('text'))),
('one more test case', CONCAT_WS(' ', soundex('one'), soundex('more'), soundex('test'), soundex('case'))),
('just filling the index', CONCAT_WS(' ', soundex('just'), soundex('filling'), soundex('the'), soundex('index'))),
('need one more example', CONCAT_WS(' ', soundex('need'), soundex('one'), soundex('more'), soundex('example'))),
('seems to need more', CONCAT_WS(' ', soundex('seems'), soundex('to'), soundex('need'), soundex('more')))
('some helpful cases to consider', CONCAT_WS(' ', soundex('some'), soundex('helpful'), soundex('cases'), soundex('to'), soundex('consider')))

select * from with_soundex where match(soundex) against (soundex('test'));
+----+---------------------+---------------------+
| id | original            | soundex             |
+----+---------------------+---------------------+
|  1 | add some test cases | A300 S500 T230 C000 | 
|  2 | this is some text   | T200 I200 S500 T230 | 
|  3 | one more test case  | O500 M600 T230 C000 | 
+----+---------------------+---------------------+

select * from with_soundex where match(soundex) against (CONCAT_WS(' ', soundex('test'), soundex('some')));
+----+--------------------------------+---------------------------+
| id | original                       | soundex                   |
+----+--------------------------------+---------------------------+
|  1 | add some test cases            | A300 S500 T230 C000       | 
|  2 | this is some text              | T200 I200 S500 T230       | 
|  3 | one more test case             | O500 M600 T230 C000       | 
|  7 | some helpful cases to consider | S500 H414 C000 T000 C5236 | 
+----+--------------------------------+---------------------------+

That gives pretty good results (within the limits of the soundex algo) while taking maximum advantage of an index (any query LIKE '%foo' has to scan every row in the table).

Note the importance of running soundex on each word, not on the entire phrase. You could also run your own version of soundex on each word rather than having SQL do it but in that case make sure you do it both when storing and retrieving in case there are differences between the algorithms (for instance, MySQL's algo doesn't limit itself to the standard 4 chars)

Rob Van Dam