Say, using MySQL, if the movies
table has 20,000 records, and each record has a field that is the description of the movie, up to 2k byte long. How can we search for movies with the word "nature" in its description? If possible, it is to be fast, instead of going through all the 20,000 records. (if in some other situations, like for books, where n
can be 200,000 or more).
views:
38answers:
2You want to use a full-text search index in this case. Be aware that there are some catches though, such as minimum word, length, stop-words etc.
The syntax for FTS looks like this:
WHERE MATCH (field) AGAINST ('text');
I wouldn't process the description column directly - per-row functions on selects rarely scale well. One of the guidelines I subscribe to is to never have to process things inside columns (like descriptions in your case, or parts of comma-separated-variable columns or even names (first/last) and address (street/town/state) components). If you're doing that, there's usually a more efficient way.
What I would do is to have insert, update and delete triggers on the table. For the insert/update triggers, I would populate another table along the lines of DescLookup
below:
Movies:
Id primary key
Title
Description
DescLookup:
Word
MovieId foreign key Movies(Id)
Count
primary key (Word,MovieId)
index (MovieId)
Basically, for every non-noise word in the description (i.e., discount things like and
, or
, by
, punctuation, single-letter words and so on), you get an entry in this table (with the lower-cased word).
Make sure that the trigger deletes all current rows for that MovieId
before re-populating lest you be left with incorrect information.
Then you use that table to run your queries. This moves the "cost" of finding the words to the insert/update rather than every single select, amortising that cost considerably. This works well because the vast majority of databases are read far more often than written so moving the cost to the write part is a good idea.
Keep in mind there's extra storage required for this but, if you examine the large number of questions people ask about databases, "How can I do this fast?" far outweighs "How can I use less disk space?".
And the delete trigger will simply remove all entries in the DescLookup
table with the relevant MovieId
.
Because the Word
column is indexed (and also, as you requested, you will not be searching every single description field), searches on it will be blindingly fast. That's because:
select MovieId from DescLookup where Word = 'nature';
will blow:
select Id from Movies where lower(Description) like '%nature%';
out of the water.