views:

50

answers:

2

Hi there,

This is the first time I'm approaching an extremely high-volume situation. This is an ad server based on MySQL. However, the query that is used incorporates a lot of JOINs and is generally just slow. (This is Rails ActiveRecord, btw)

sel = Ads.find(:all, :select => '*', :joins => "JOIN campaigns ON ads.campaign_id = campaigns.id JOIN users ON campaigns.user_id = users.id LEFT JOIN countries ON countries.campaign_id = campaigns.id LEFT JOIN keywords ON keywords.campaign_id = campaigns.id", :conditions => [flashstr + "keywords.word = ? AND ads.format = ? AND campaigns.cenabled = 1 AND (countries.country IS NULL OR countries.country = ?) AND ads.enabled = 1 AND campaigns.dailyenabled = 1 AND users.uenabled = 1", kw, format, viewer['country'][0]], :order => order, :limit => limit)

My questions:

  1. Is there an alternative database like MySQL that has JOIN support, but is much faster? (I know there's Postgre, still evaluating it.)

  2. Otherwise, would firing up a MySQL instance, loading a local database into memory and re-loading that every 5 minutes help?

  3. Otherwise, is there any way I could switch this entire operation to Redis or Cassandra, and somehow change the JOIN behavior to match the (non-JOIN-able) nature of NoSQL?

Thank you!


EDIT: here are more details:

Full executed SQL with flattened select (truncated above):

SELECT campaigns.id, campaigns.guid, campaigns.user_id, campaigns.dailylimit, campaigns.impressions, campaigns.cenabled, campaigns.dayspent, campaigns.dailyenabled, campaigns.fr, ads.id, ads.guid, ads.user_id, ads.campaign_id, ads.format, ads.enabled, ads.datafile, ads.data1, ads.data2, ads.originalfilename, ads.aid, ads.impressions, countries.id, countries.guid, countries.campaign_id, countries.country, keywords.id, keywords.campaign_id, keywords.word, keywords.bid FROM ads JOIN campaigns ON ads.campaign_id = campaigns.id JOIN users ON campaigns.user_id = users.id LEFT JOIN countries ON countries.campaign_id = campaigns.id LEFT JOIN keywords ON keywords.campaign_id = campaigns.id WHERE (keywords.word = 'design' AND ads.format = 10 AND campaigns.cenabled = 1 AND (countries.country IS NULL OR countries.country = 82) AND ads.enabled = 1 AND campaigns.dailyenabled = 1 AND users.uenabled = 1 AND ads.datafile != '') ORDER BY keywords.bid DESC LIMIT 1,1

EXPLAIN/execution plan:

+----+-------------+-----------+--------+------------------+-------------+---------+------------------------------------+------+----------------------------------------------+
| id | select_type | table     | type   | possible_keys    | key         | key_len | ref                                | rows | Extra                                        |
+----+-------------+-----------+--------+------------------+-------------+---------+------------------------------------+------+----------------------------------------------+
|  1 | SIMPLE      | keywords  | ref    | campaign_id,word | word        | 257     | const                              |    9 | Using where; Using temporary; Using filesort | 
|  1 | SIMPLE      | ads       | ref    | campaign_id      | campaign_id | 4       | e_development.keywords.campaign_id |    8 | Using where                                  | 
|  1 | SIMPLE      | campaigns | eq_ref | PRIMARY          | PRIMARY     | 4       | e_development.keywords.campaign_id |    1 | Using where                                  | 
|  1 | SIMPLE      | users     | eq_ref | PRIMARY          | PRIMARY     | 4       | e_development.campaigns.user_id    |    1 | Using where                                  | 
|  1 | SIMPLE      | countries | ALL    | campaign_id      | NULL        | NULL    | NULL                               |    4 | Using where                                  | 
+----+-------------+-----------+--------+------------------+-------------+---------+------------------------------------+------+----------------------------------------------+

(this is on a development database, which doesn't have nearly as many rows as the production version.)

DEFINED INDICES:

ads -> id (primary, autoinc) + aid (unique) + campaign_id (index) + user_id (index)
campaigns -> id (primary, autoinc) + user_id (index)
countries -> id (primary, autoinc) + campaign_id (index) + country (index) + user_id (index)
keywords -> id (primary, autoinc) + campaign_id (index) + word (index) + user_id (index)
user -> id (primary, autoinc)
+1  A: 

Have you analysed your execution plan? Have you analysed your indices?

My first guess would be that you need an index on campaigns for user_id, index on countries for campaign_id, on keywords on campaign_id... maybe others. You need to get an execution plan to see what your query is doing.

The other option: How often does the data in this result set change? By the minute? Hour? Day? If it's daily or hourly (well, several hours), it might be better to have a secondary table that contains ALL the columns (or just the columns that aren't likely to change frequently) of this result set and is populated by this query every n hours. Then your app would just query the secondary table (or maybe join with one table that has frequently changing data), it could be faster that way.

FrustratedWithFormsDesigner
The execution plan is above. I did already have the indices you mentioned (see above.) The data changes probably around every hour. I like the idea of a secondary table, but I have a feeling that it'd be a very, very huge table since we're doing a lot of joins. Maybe sticking it in memory is a better idea? (we have a lot of memory to work with, btw.) Thank you for your answer!
jkaz
@jkaz: yeah I think I posted before you posted the plan and index info. Looks good I guess, so time to optimize on something else! The in-memory table/cache with periodic updates is still the best thing I've seen so far for this, but after some coffee , maybe I'll have some other ideas. ;)
FrustratedWithFormsDesigner
Thanks, Frustrated!
jkaz
+1  A: 

Database theory and the nominal practice exist to provide a framework for a majority of cases. Not every database usage pattern fits neatly into 3rd normal form. Hence the emergence of NoSQL. These database don't work well in a majority of cases but do work great in specific cases. One reason they work well is because they DON'T work like a normal RDBMS. Cassandra does have some facility for 'joining' but I don't remember the exact details. If you want a quick understanding I'd recommend the Digg developers blog. There's a nice simple description.

The problem is that I'll bet you a pickle that joining 4 tables would be slower than mySQL. And the only way to know for sure would be learning a new DBMS, installing it, tuning the install as well as you can tune MySQL and setting up all your data and .... you'll like find out MySQL does pretty damn good.

Trying to solve the EXACT SAME problem the EXACT SAME way with a different engine won't cut it... you have to THINK like a NoSQL developer, not a RDBMS developer using NoSQL.

But you can think about the problem as Frustrated suggests.

Why do we have Third Normal Form? Ease of Update mainly. I update one row instead of dozens. It also helps constrain data, if I carefully control addition of countries in the country table, I'll never get a bad one in the campaign table. After that, 3NF doesn't make querying faster, which is why we invented reporting databases, OLAP, Cubes, Star Schemas.

the Key is that it's a different structure for reporting vs editing/capturing.

As Frustrated said, determine the speed of change in your underlying data. If you're really adding countries every 5 minutes, I'll be stunned. Campaigns? probably occasional? Ads? a couple times a day. How long would it take to build a fully flattened table and index it? How many rows does that produce? if that cycle time is much shorter than your update frequency... build that and see. Test the query speed. That's a cheaper experiment than going for a whole new DB.

Stephanie Page
I agree, that's an excellent idea and I'll definitely be trying that out. It would certainly be faster than doing all the joins—maybe we can even stick the resulting flat database in a NoSQL table!—so I'll try it out and report back on results. Thank you!
jkaz
Mmm... no. You wouldn't put them in a noSql table unless you want basically one indexed value (I said basically). At it's heart noSql are simply keyword/value pairs, where the value can be a complex type (like address). If you want to search all addresses by zip there are further complications but i think recent enhancements can speed that up. But that's been in mySql for 15 years. An index on zip, voila. If you always use the same filter columns, your index will be obvious, if you mix and match, research how mySql can use more than one index. I've posted about that here.
Stephanie Page