tags:

views:

38

answers:

2

Say, using MySQL, if the movies table has 20,000 records, and each record has a field that is the description of the movie, up to 2k byte long. How can we search for movies with the word "nature" in its description? If possible, it is to be fast, instead of going through all the 20,000 records. (if in some other situations, like for books, where n can be 200,000 or more).

+2  A: 

You want to use a full-text search index in this case. Be aware that there are some catches though, such as minimum word, length, stop-words etc.

The syntax for FTS looks like this:

WHERE MATCH (field) AGAINST ('text');
Nathan Reed
+4  A: 

I wouldn't process the description column directly - per-row functions on selects rarely scale well. One of the guidelines I subscribe to is to never have to process things inside columns (like descriptions in your case, or parts of comma-separated-variable columns or even names (first/last) and address (street/town/state) components). If you're doing that, there's usually a more efficient way.

What I would do is to have insert, update and delete triggers on the table. For the insert/update triggers, I would populate another table along the lines of DescLookup below:

Movies:
    Id primary key
    Title
    Description
DescLookup:
    Word
    MovieId foreign key Movies(Id)
    Count
    primary key (Word,MovieId)
    index (MovieId)

Basically, for every non-noise word in the description (i.e., discount things like and, or, by, punctuation, single-letter words and so on), you get an entry in this table (with the lower-cased word).

Make sure that the trigger deletes all current rows for that MovieId before re-populating lest you be left with incorrect information.

Then you use that table to run your queries. This moves the "cost" of finding the words to the insert/update rather than every single select, amortising that cost considerably. This works well because the vast majority of databases are read far more often than written so moving the cost to the write part is a good idea.

Keep in mind there's extra storage required for this but, if you examine the large number of questions people ask about databases, "How can I do this fast?" far outweighs "How can I use less disk space?".

And the delete trigger will simply remove all entries in the DescLookup table with the relevant MovieId.

Because the Word column is indexed (and also, as you requested, you will not be searching every single description field), searches on it will be blindingly fast. That's because:

select MovieId from DescLookup where Word = 'nature';

will blow:

select Id from Movies where lower(Description) like '%nature%';

out of the water.

paxdiablo
This is a really cool approach, but I'm guessing the OP already has his table populated. Can you suggest a way for him to generate the starting dataset from what he has now?
Andrew Heath
Yes, create the auxillary table, set up the insert/update trigger, then just do: `update movies set desc = desc;`. That should fire the trigger for every row in the database. If your DBMS is so smart (or sneaky, IMNSHO) as to not recognise this as a real update, simply do: `update movies set desc = ' '||desc; update movies set desc = substr(desc,2);` or something similar.
paxdiablo