views:

39

answers:

2

Now I'm a really advanced PHP developer and heavily knowledged on small-scale MySQL sets, however I'm now building a large infrastructure for a startup I've recently joined and their servers push around 1 million rows of data every day using their massive server power and previous architecture.

I need to know what is the best way to search through large data sets (it currently resides at 84.9 million) rows with a database size of 394.4 gigabytes. It is hosted using Amazon RDS so it does not have any downtime or slowness, it's just that I want to know what's the best way to access large data sets internally.

For example, if I wanted to search through a database of 84 million rows it takes me 6 minutes. Now, if I made a direct request to a specific id or title it would serve it instantly. So how would I search through a large data set.

Just to remind you, it's fast to find information through database by passing in one variable but when searching it performs VERY slow.

MySQL query example:

SELECT u.*, COUNT(*) AS user_count, f.* FROM users u LEFT JOIN friends f ON u.user_id=(f.friend_from||f.friend_to) WHERE u.user_name LIKE ('%james%smith%') GROUP BY u.signed_up LIMIT 0, 100

That query under 84m rows is sigificantly slow. Specifically 47.41 seconds to perform this query standalone, any ideas guys?

All I need is that challenge sorted and I'll be able to get the drift. Also, I know MySQL isn't very good for large data sets and something like Oracle or MSSQL however I've been told to rebuild it on MySQL rather than other database solutions at this moment.

+1  A: 

LIKE is VERY slow for a variety of reasons:

  • Unless your LIKE expression starts with a constant, no index will be used.

    E.g. LIKE ('james%smith%') is good, LIKE ('%james%smith%') is bad for indexing. Your example will NOT use any indexes on "user_name" field.

  • String matching is complex (algorythmically) business compared to regular operators.

To resolve:

  • Make sure your LIKE expression starts with a constant, not a wildcard, if you have an index on that field you might be able to use.

  • Consider making an index table (in the literature/library context of the word "index", not a database index context) if you search for whole words. Or a substring lookup table if searching for random often repeating substrings.

    E.g. if all user names are of the form "FN LN" or "LN, FN" - split them up and store first names and/or last names in a dictionary table, joining to that table (and doing straight equality) in your query.

DVK
A: 
LIKE ('%james%smith%')

Avoid these things like the plague. They are impossible for a general DBMS to optimise.

The right way is to calculate things like this (first and last names) at the time where the data is inserted or updated so that the cost is amortised across all reads. This can be done by adding two new columns (indexed) and using insert/update triggers.

Or, if you want all words in the column, have the trigger break the data into words then have an application-level index table to find relevant records, something like:

main_table:
    id integer primary key
    blah blah blah
    text varchar(60)
appl_index:
    id index
    word varchar(20)
    primary key (id,word)
    index (word)

Then you can query appl_index to find those ids that have both james and smith in them, far faster than the abominable like '%...'. You could also break the actual words out to a separate table and use word IDs but that's a matter of taste - it's effect on performance would be questionable.

You may well have a similar problems with f.friend_from||f.friend_to but I've not seen that syntax before (if, as it seems to be, the context is u.user_id can be one or the other).

Basically, if you want your databases to scale, don't do anything that even looks like a per-row function in your selects. Take that from someone who works with mainframe databases where 84 million rows is about the size of our config tables :-)

And, as with all optimisation questions, measure, don't guess!

paxdiablo
@pax - if your config tables are 84mm rows, I have two words for you: FEATURE CREEP! :)
DVK