How might I optimize this query? The schema:
mysql> show columns from transactionlog;
+---------------+-------------------------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+-------------------------------------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| transactionid | varchar(10) | NO | MUL | NULL | |
| queryid | tinyint(4) | NO | | NULL | |
| tableid | varchar(30) | NO | MUL | NULL | |
| tupleid | int(11) | NO | | NULL | |
| querytype | enum('select','insert','delete','update') | NO | | NULL | |
| schemaname | varchar(20) | YES | | NULL | |
| partition | tinyint(3) unsigned | YES | | NULL | |
+---------------+-------------------------------------------+------+-----+---------+----------------+
8 rows in set (0.04 sec)
The query:
select concat(weight, ' ', ids, '\n')
from (
select
tableid,
tupleid,
group_concat(id separator ' ') as ids,
(
select count(distinct transactionid)
from transactionlog
where transactionid in (
select transactionid
from transactionlog
where (tableid, tupleid, querytype) =
(t.tableid, t.tupleid, 'update')
group by transactionid
having count(*) > 0
)
) weight
from transactionlog t
group by tableid, tupleid
having weight > 0 and count(*) > 1
) u;
This is the output of EXPLAIN and mk-visual-explain:
+----+--------------------+----------------+-------+---------------+---------------+---------+-----------+------+------------------------------
----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+----------------+-------+---------------+---------------+---------+-----------+------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 13 | |
| 2 | DERIVED | t | ALL | NULL | NULL | NULL | NULL | 68 | Using filesort |
| 3 | DEPENDENT SUBQUERY | transactionlog | index | NULL | transactionid | 12 | NULL | 68 | Using where; Using index |
| 4 | DEPENDENT SUBQUERY | transactionlog | ref | tableid | tableid | 36 | func,func | 2 | Using where; Using temporary; Using filesort |
+----+--------------------+----------------+-------+---------------+---------------+---------+-----------+------+----------------------------------------------+
Table scan
rows 13
+- DERIVED
table derived(t,transactionlog,temporary(transactionlog))
+- DEPENDENT SUBQUERY
+- DEPENDENT SUBQUERY
| +- Filesort
| | +- TEMPORARY
| | table temporary(transactionlog)
| | +- Filter with WHERE
| | +- Bookmark lookup
| | +- Table
| | | table transactionlog
| | | possible_keys tableid
| | +- Index lookup
| | key transactionlog->tableid
| | possible_keys tableid
| | key_len 36
| | ref func,func
| | rows 2
| +- Filter with WHERE
| +- Index scan
| key transactionlog->transactionid
| key_len 12
| rows 68
+- Filesort
+- Table scan
rows 68
+- Table
table t
That is a lot of work. I can write the equivalent logic in Python while making a single pass:
results = query("""
select tableid, tupleid, transactionid, id, querytype
from transactionlog_2warehouse
""")
_tab, _tup = None
ids = []
weight = 0
saw_upd = False
for tab, tup, txn, id, qt in results:
if (_tab, _tup) != (tab, tup):
if len(ids) > 1 and weight > 0:
print weight, ids
weight = 0
ids = []
_txn = None
if _txn != txn:
saw_upd = False
if qt == 'update' and not saw_upd:
weight += 1
saw_upd = True
ids += [id]
Is it possible to achieve the Python single-pass performance using pure SQL? Thanks in advance!