ansaurus

Question

Answer 1

A:

The problem is most likely the WHERE condition:

where (left(`l`.`name`,(locate(_utf8'_',`l`.`name`) - 1)) regexp _utf8'[[:digit:]]+')

This looks like complex string comparison, so no index can be used, which results in a full table scan, possibly for every row in the result set. I am not a MySQL expert, but if you can simplify this into more simple column comparisons it will probably run much faster.

FelixM 2009-12-22 04:59:30

That was my idea too, but in order to fake the user ID to Rails every contact list needs to have the user ID appended to it in the other app (e.g. User #3 in the Rails app's contact list MUST be name 3_Contact_List_Name), so there was no other way I was aware of to ensure that this constraint is enforced to make sure it pulls the right records.

Wayne M 2009-12-22 12:17:11

Answer 2

A:

That is, without a doubt, the second most hideous SQL query I have ever laid my eyes on :-)

My advice is to trade storage requirements for speed. This is a common trick used when you find your queries have a lot of per-row functions (ifnull, case and so forth). These per-row functions never scale very well as the table becomes larger.

Create new fields in the table which will hold the values you want to extract and then calculate those values on insert/update (with a trigger) rather than select. This doesn't technically break 3NF since the triggers guarantee data consistency between columns.

The vast majority of database tables are read far more often than they're written so this will amortise the cost of the calculation across many selects. In addition, just about every reported problem with databases is one of speed, not storage.

An example of what I mean. You can replace:

case when (`s`.`fieldid` in (2,35)) then `s`.`data` else NULL end

with:

`s`.`data_2_35`

in your query if your insert/update trigger simply sets the data_2_35 column to data or NULL depending on the value of fieldid. Then you index data_2_35 and, voila, instant speed improvement at the cost of a little storage.

This trick can be done to the five case clauses, the left/regexp bit and the "naked" ifnull function as well (the ifnull functions containing min and group_concat may be harder to do).

paxdiablo 2009-12-22 05:04:35

Show me the *first* most hideous SQL query you've seen. :)

netrox 2009-12-22 05:09:20

That would be one of mine from my early days of which I have removed all traces, including from my brain (other than the sheer horror of it, of course). :-)

paxdiablo 2009-12-22 05:21:01

LOL! Are you making a *Get Smart* reference with the "second-most-hideous" comment?

Bill Karwin 2009-12-22 05:35:22

Answer 3

A:

The first thing that jumps out at me as the source of all the trouble:

The PHP app's table is an E-A-V style table...

Trying to convert data in EAV format into conventional relational format on the fly using SQL is bound to be awkward and inefficient. So don't try to smash it into a conventional column-per-attribute format. The following query returns multiple rows per subscriber, one row per EAV attribute:

SELECT ls.subscriberid AS id,
  SUBSTRING_INDEX(l.name, _utf8'_', 1) AS user_id,
  COALESCE(ls.emailaddress, _utf8'') AS email_address,
  s.fieldid, s.data
FROM list_subscribers ls JOIN lists l ON (ls.listid = l.listid)
  LEFT JOIN subscribers_data s ON (ls.subscriberid = s.subscriberid
      AND s.fieldid IN (2,3,34,35,36,81,100,154)
WHERE SUBSTRING_INDEX(l.name, _utf8'_', 1) REGEXP _utf8'[[:digit:]]+'

This eliminates the GROUP BY which is not optimized well in MySQL -- it usually incurs a temporary table which kills performance.

id      user_id      email_address     fieldid        data
1       1            [email protected]  2              John
1       1            [email protected]  3              Doe
1       1            [email protected]  81             5551234567

But you'll have to sort out the EAV attributes in application code. That is, you can't seamlessly use ActiveRecord in this case. Sorry about that, but that's one of the disadvantages of using a non-relational design like EAV.

The next thing that I notice is the killer string manipulation (even after I've simplified it with SUBSTRING_INDEX()). When you're picking substrings out of a column, this says you me that you've overloaded one column with two distinct pieces of information. One is the name and the other is some kind of list-type attribute that you would use to filter the query. Store one piece of information in each column.

You should add a column for this attribute, and index it. Then the WHERE clause can utilize the index:

SELECT ls.subscriberid AS id,
  SUBSTRING_INDEX(l.name, _utf8'_', 1) AS user_id,
  COALESCE(ls.emailaddress, _utf8'') AS email_address,
  s.fieldid, s.data
FROM list_subscribers ls JOIN lists l ON (ls.listid = l.listid)
  LEFT JOIN subscribers_data s ON (ls.subscriberid = s.subscriberid
      AND s.fieldid IN (2,3,34,35,36,81,100,154)
WHERE l.list_name_contains_digits = 1;

Also, you should always analyze an SQL query with EXPLAIN if it's important for them to have good performance. There's an analogous feature in MS SQL Server, so you should be accustomed to the concept, but the MySQL terminology may be different.

You'll have to read the documentation to learn how to interpret the EXPLAIN report in MySQL, there's too much info to describe here.

Re your additional info: Yes, I understand you can't do away with the EAV table structure. Can you create an additional table? Then you can load the EAV data into it:

CREATE TABLE subscriber_mirror (
  subscriberid INT PRIMARY KEY,
  first_name      VARCHAR(100),
  last_name       VARCHAR(100),
  first_name2     VARCHAR(100),
  last_name2      VARCHAR(100),
  mobile_phone    VARCHAR(100),
  sms_only        VARCHAR(100),
  mobile_carrier  VARCHAR(100)
);

INSERT INTO subscriber_mirror (subscriberid) 
    SELECT DISTINCT subscriberid FROM list_subscribers;

UPDATE subscriber_data s JOIN subscriber_mirror m USING (subscriberid)
SET m.first_name    = IF(s.fieldid = 2,  s.data, m.first_name),
    m.last_name     = IF(s.fieldid = 3,  s.data, m.last_name),
    m.first_name2   = IF(s.fieldid = 35, s.data, m.first_name2),
    m.last_name2    = IF(s.fieldid = 36, s.data, m.last_name2),
    m.mobile_phone  = IF(s.fieldid = 81, s.data, m.mobile_phone),
    m.sms_only      = IF(s.fieldid = 100, s.data, m.sms_only),
    m.mobile_carrer = IF(s.fieldid = 34,  s.data, m.mobile_carrier);

This will take a while, but you only need to do it when you get a new data update from the vendor. Subsequently you can query subscriber_mirror in a much more conventional SQL query:

SELECT ls.subscriberid AS id, l.name+0 AS user_id,
  COALESCE(s.first_name, s.first_name2) AS first_name,
  COALESCE(s.last_name, s.last_name2) AS last_name,
  COALESCE(ls.email_address, '') AS email_address),
  COALESCE(s.mobile_phone, '') AS mobile_phone,
  COALESCE(s.sms_only, '') AS sms_only,
  COALESCE(s.mobile_carrier, '') AS mobile_carrier
FROM lists l JOIN list_subscribers USING (listid)
JOIN subscriber_mirror s USING (subscriberid)
WHERE l.name+0 > 0

As for the userid that's embedded in the l.name column, if the digits are the leading characters in the column value, MySQL allows you to convert to an integer value much more easily:

An expression like '123_bill'+0 yields an integer value of 123. An expression like 'bill_123'+0 has no digits at the beginning, so it yields an integer value of 0.

Bill Karwin 2009-12-22 05:33:13

Unfortunately we're stuck with this application since we use it for managing each user's contact lists (it's a commercial app) and the designers of said app decided to use EAV.

Wayne M 2009-12-22 12:13:53

Also since the app gets regular updates from it's vendor and we're required to upgrade to receive support, I can't really add extra columns to their tables in case it breaks anything. The regexp is there to ensure that the list names contain a digit, which is extracted out into user ID so Rails knows which user has which contact list.

Wayne M 2009-12-22 12:43:02

The only problem with your second solution is the subscribers_data table is updated almost constantly since the third party tool allows users to sign up for mailings. So there could be hundreds of new users on a daily basis.

Wayne M 2009-12-22 21:33:29

Okay, but updating the `subscribers_mirror` table incrementally with a few hundred rows a day as they come in is easy. And then querying this normalized table is a lot more efficient than trying to work with the EAV data.

Bill Karwin 2009-12-22 21:41:00

ansaurus

tags:

views:

answers:

Optimizing a strange MySQL Query

related questions