ansaurus

Question

Using MySQL to search through large data sets?

Answer 1

+1 A:

LIKE is VERY slow for a variety of reasons:

Unless your LIKE expression starts with a constant, no index will be used.

E.g. LIKE ('james%smith%') is good, LIKE ('%james%smith%') is bad for indexing. Your example will NOT use any indexes on "user_name" field.
String matching is complex (algorythmically) business compared to regular operators.

To resolve:

Make sure your LIKE expression starts with a constant, not a wildcard, if you have an index on that field you might be able to use.
Consider making an index table (in the literature/library context of the word "index", not a database index context) if you search for whole words. Or a substring lookup table if searching for random often repeating substrings.

E.g. if all user names are of the form "FN LN" or "LN, FN" - split them up and store first names and/or last names in a dictionary table, joining to that table (and doing straight equality) in your query.

DVK 2010-08-02 00:58:45

Answer 2

A:

LIKE ('%james%smith%')

Avoid these things like the plague. They are impossible for a general DBMS to optimise.

The right way is to calculate things like this (first and last names) at the time where the data is inserted or updated so that the cost is amortised across all reads. This can be done by adding two new columns (indexed) and using insert/update triggers.

Or, if you want all words in the column, have the trigger break the data into words then have an application-level index table to find relevant records, something like:

main_table:
    id integer primary key
    blah blah blah
    text varchar(60)
appl_index:
    id index
    word varchar(20)
    primary key (id,word)
    index (word)

Then you can query appl_index to find those ids that have both james and smith in them, far faster than the abominable like '%...'. You could also break the actual words out to a separate table and use word IDs but that's a matter of taste - it's effect on performance would be questionable.

You may well have a similar problems with f.friend_from||f.friend_to but I've not seen that syntax before (if, as it seems to be, the context is u.user_id can be one or the other).

Basically, if you want your databases to scale, don't do anything that even looks like a per-row function in your selects. Take that from someone who works with mainframe databases where 84 million rows is about the size of our config tables :-)

And, as with all optimisation questions, measure, don't guess!

paxdiablo 2010-08-02 01:02:03

@pax - if your config tables are 84mm rows, I have two words for you: FEATURE CREEP! :)

DVK 2010-08-02 01:17:36

ansaurus

tags:

views:

answers:

Using MySQL to search through large data sets?

related questions