views:

113

answers:

3

I have a SUPER slow query, which I posted here: http://pastebin.com/E5sdRi7e. When I did an EXPLAIN, I got the following:

id  select_type         table       type    possible_keys  key           key_len  ref                                 rows  Extra
1   PRIMARY             <derived2>  ALL     NULL           NULL          NULL     NULL                                5     Using filesort
2   DERIVED             Workflow    ALL     PRIMARY        NULL          NULL     NULL                                9     Using temporary; Using filesort
2   DERIVED             <derived3>  ALL     NULL           NULL          NULL     NULL                                141   Using where; Using join buffer
2   DERIVED             DataSource  ALL     PRIMARY        NULL          NULL     NULL                                1310  Using where; Using join buffer
2   DERIVED             <derived4>  ALL     NULL           NULL          NULL     NULL                                1310  Using where; Using join buffer
2   DERIVED             User        eq_ref  PRIMARY        PRIMARY       4        LatestDataSourceActivityLog.UserId  1
4   DERIVED             t1          ALL     NULL           NULL          NULL     NULL                                5400  Using where; Using temporary; Using filesort
5   DEPENDENT SUBQUERY  t2          ref     DataSourceId   DataSourceId  4        companyname_db.t1.DataSourceId      4
3   DERIVED             DataSource  range   PRIMARY        PRIMARY       4        NULL                                142   Using where

What does the above table tell me? Does it help me identify which fields should be indexed?

Any help is greatly appreciated.

Query

SELECT WrappedData.*
FROM   (SELECT ParentLeafNodeDataSource.Id,
               LatestDataSourceActivityLog.UserId,
               DataSource.Status AS StatusCode,
               ( CASE
                   WHEN User.Name IS NULL THEN 'CompanyName'
                   ELSE User.Name
                 END )           AS `Username`,
               Workflow.Name     AS WorkflowName,
               LatestDataSourceActivityLog.Timestamp
        FROM   DataSource,
               Workflow,
               (SELECT *
                FROM   DataSource
                WHERE  DataSource.Id IN ( 0, 1, 2, 3,
                                          4, 5, 6, 7,
                                          8, 9, 10, 11,
                                          12, 13, 16, 21,
                                          22, 23, 24, 25,
                                          26, 27, 28, 29,
                                          30, 31, 32, 33,
                                          34, 35, 36, 37,
                                          38, 39, 40, 41,
                                          42, 43, 44, 45,
                                          46, 47, 48, 49,
                                          50, 51, 52, 53,
                                          54, 55, 56, 57,
                                          58, 59, 60, 61,
                                          62, 63, 64, 65,
                                          66, 67, 68, 69,
                                          70, 71, 72, 73,
                                          74, 75, 76, 77,
                                          78, 79, 80, 81,
                                          83, 84, 85, 86,
                                          87, 88, 89, 90,
                                          91, 92, 93, 94,
                                          95, 96, 97, 98,
                                          99, 100, 101, 102,
                                          103, 104, 105, 106,
                                          107, 108, 109, 110,
                                          111, 112, 113, 114,
                                          115, 116, 117, 118,
                                          119, 120, 142, 1293,
                                          1294, 1295, 1296, 1297,
                                          1298, 1299, 143, 1300,
                                          1301, 1302, 1303, 1304,
                                          1305, 1306, 144, 146,
                                          145, 1307, 1308, 1309,
                                          1310, 147, 149, 148,
                                          150, 151 )) AS ParentLeafNodeDataSource,
               (SELECT t1.*
                FROM   DataSourceActivityLog AS t1
                WHERE  Timestamp = (SELECT Max(t2.Timestamp)
                                    FROM   DataSourceActivityLog AS t2
                                    WHERE  t1.DataSourceId = t2.DataSourceId)
                GROUP  BY t1.DataSourceId) AS LatestDataSourceActivityLog
               LEFT JOIN User
                 ON User.Id = LatestDataSourceActivityLog.UserId
        WHERE  ParentLeafNodeDataSource.Status = '203'
                OR ParentLeafNodeDataSource.Status = '204'
                   AND Workflow.Id = ParentLeafNodeDataSource.WorkflowId
                   AND LatestDataSourceActivityLog.DataSourceId = ParentLeafNodeDataSource.Id
                   AND DataSource.Id = LatestDataSourceActivityLog.DataSourceId
                   AND LatestDataSourceActivityLog.UserId = 1
        GROUP  BY ParentLeafNodeDataSource.Id) AS WrappedData
ORDER  BY WrappedData.`Timestamp` DESC
A: 

Have you considered the MySql Query Profiler ?

This is how you will understand your performance problems.

Without that step, most people here will sadly prefer writing jokes on your query than trying to help you.

Pierre 303
@Pierre - is the query that bad that it's a joking matter? If so, would you be able to point out what my mistakes were?
StackOverflowNewbie
c'mon Pierre. Why so serious?!
Henrik P. Hessel
@Pierre, I ran profiler. The suspect entry was this: "Copying to tmp table 97.271238". Now, how do I know which part of my SQL is causing this copying to temp table?
StackOverflowNewbie
+2  A: 

It's very difficult to say conclusively, but here are a couple refactoring things.

On performance, the first thing to look at are GROUP functions.

           (SELECT t1.*
            FROM   DataSourceActivityLog AS t1
            WHERE  Timestamp = (SELECT Max(t2.Timestamp)
                                FROM   DataSourceActivityLog AS t2
                                WHERE  t1.DataSourceId = t2.DataSourceId)
            GROUP  BY t1.DataSourceId) AS LatestDataSourceActivityLog

Which can eliminate the use of MAX entirely

           (SELECT t1.*
            FROM   DataSourceActivityLog AS t1
            WHERE  Timestamp = (SELECT t2.Timestamp
                                FROM   DataSourceActivityLog AS t2
                                WHERE  t1.DataSourceId = t2.DataSourceId
                                ORDER BY t2.Timestamp DESC
                                LIMIT 1)
            GROUP  BY t1.DataSourceId) AS LatestDataSourceActivityLog

Probably not a big performance issue, but here you can use IFNULL or COALESCE instead of a CASE.

( CASE
    WHEN User.Name IS NULL THEN 'CompanyName'
    ELSE User.Name
END )

Instead

( IFNULL(User.Name,'CompanyName' )

In terms of indexes, they increase SELECT performance by making lookups easier, but they slow down write operations as the indexes have to be updated as well. If your application isn't write-heavy, you should be indexing commonly searched columns, particularly in large tables.

In this query, it looks like you'd benefit by adding an index to DataSourceId, but I can't test if there's any gain. The primary keys will already be indexed.

Zurahn
+1  A: 

I would try the following:

  • the outer wrapper is completely useless, putting the ORDER BY in the inner query should work the same
  • try to rewrite the subqueries to be used as JOIN's
  • then move the WHERE-clauses to the relevant JOINS's so the intermediate resultset becomes smaller
  • look at the WHERE and JOIN's which indexes should be made.

A quick try (I'm not sure the result will be the same)

SELECT
    dsa.Status AS StatusCode,
    dsb.Id,
    dsl.UserId,
    dsl.Timestamp
    wf.Name AS WorkflowName,
    COALESCE(u.Name, 'CompanyName') AS `Username`
FROM 
    DataSource dsa
    INNER JOIN DataSource dsb
        ON  dsb.Id IN ( 0, 1, 2, 3, 4, 5, 6, 7, etc ))
        AND dsb.Status = '203' OR dsb.Status = '204'
    INNER JOIN DataSourceActivityLog dsl
        ON  dsl.DataSourceId=dsa.Id
        AND dsl.DataSourceId=dsb.Id
        AND dsl.UserId = 1
        AND dsl.Timestamp=(
            SELECT MAX(t2.Timestamp)
            FROM   DataSourceActivityLog AS dslt
            WHERE  dslt.DataSourceId = dsl.DataSourceId
        )
    INNER JOIN Workflow wf
                   ON  wf.Id = dsb.WorkflowId
    LEFT JOIN User u
        ON u.Id = dsl.UserId
GROUP  BY
    dsl.Id
ORDER  BY
    dsl.Timestamp DESC

Perhaps using the refactor of Zurahn to get rid of the GROUP BY in the subquery

With indexes on:

  • DataSource.WorkFlowId, DataSource.Status
  • DataSourceActivityLog.Timestamp, DataSourceActivityLog.UserId, DataSourceActivityLog.DataSourceId

Ok actually, I came to the conclusion that dsb (originally ParentLeafNodeDataSource) is actually the source of the data, and this could fill the WHERE clause. Personally I try to start with the source of the data, and then JOIN the rest against that. This will usually result in a query where it is easy to understand what is actually selected. Instead of the last JOIN suddenly cutting down on the resultset. So reordering the JOIN's could do that, and it would be something like:

SELECT
    dsa.Status AS StatusCode,
    dsb.Id,
    dsl.UserId,
    dsl.Timestamp
    wf.Name AS WorkflowName,
    COALESCE(u.Name, 'CompanyName') AS `Username`
FROM 
    DataSource dsb
    INNER JOIN Workflow wf
        ON  dsb.WorkflowId=wf.Id
    INNER JOIN DataSourceActivityLog dsl
        ON  dsl.DataSourceId=dsb.Id
        AND dsl.UserId=1
        AND dsl.Timestamp=(
            SELECT MAX(t2.Timestamp)
            FROM   DataSourceActivityLog AS dslt
            WHERE  dslt.DataSourceId = dsl.DataSourceId
        )
    INNER JOIN DataSource dsa
        ON  dsl.DataSourceId=dsa.Id
    LEFT JOIN User u
        ON dsl.UserId=u.Id
WHERE
    dsb.Id IN ( 0, 1, 2, 3, 4, 5, 6, 7, etc ))
    AND dsb.Status = '203' OR dsb.Status = '204'
GROUP  BY
    dsl.Id
ORDER  BY
    dsl.Timestamp DESC
ontrack
@ontrack - thanks for the insights. Definitely something to look into.
StackOverflowNewbie
@ontrack - tried cleaning up your SQL and ended up with this: http://pastebin.com/v57289YZ. phpMyAdmin complains: #1054 - Unknown column 'dsa.Id' in 'on clause'. Any idea what's wrong? Seems like it's saying that there is no dsa.Id, but I am sure it exist in the DB.
StackOverflowNewbie
It might be because of the order of JOIN's in my initial version, where the alias dsa does not exist yet at the point where you want to join against it. I made a few edits, please try my latest version(s)
ontrack
I'm actually wondering why you need to subselect / join DataSource again just to get the StatusCode. Just selecting the StatusCode from the first resultset probably always gives the same code because it is joined on Id. So the join with DataSource could be removed in my last example.
ontrack