ansaurus

Question

Finding duplicate records in mysql based on a bitmask.

Answer 1

A:

Break the mask bits out in real columns. RDMBs don't like bit fields.

msw 2010-08-02 19:01:54

Ideally these would be a table-to-table relationship, but redesigning the structure of the DB at this juncture is not possible.

JYelton 2010-08-02 20:37:33

Answer 2

A:

Your join results in a cartesian product of the table with itself. Add `t1.name=t2.name' to the join, giving a bunch of (much) smaller cartesian products, one per unique name, which will speed things up considerably.

Jim Garrison 2010-08-02 19:12:59

Answer 3

+2 A:

Since the database structure was not designed with the realities of RDBMs in mind (probably not your doing, I just have to make the point anyway…), the performance will always be poor, though it is possible to write a query that will finish.

Jim is correct in that the query results in a cartesian product. If that query were to be returned ungrouped and unfiltered, you could expect (SELECT POW(COUNT(*), 2) FROM data) results. Also, any form of outer join is unnecessary, so a standard inner join is what you want here (not that it ought to make a difference in terms of performance, it's just more appropriate).

Also another condition of the join, t1.id != t2.id is necessary, lest each record match itself.

SELECT t1.id, t1.name
FROM data t1
JOIN data t2
     ON t1.name = t2.name
     AND t1.id != t2.id     //
WHERE CONV(t1.mask, 2, 10) & CONV(t2.mask, 2, 10) > 0
GROUP BY t1.name
HAVING COUNT(*) > 1;

Your incomplete query:

SELECT t1.id, t1.name, t1.mask
FROM data t1
WHERE t1.name IN (SELECT t2.name FROM data t2 GROUP BY t2.name HAVING COUNT(*) > 1);

SELECT t1.id, t1.name, t1.mask
FROM data t1
WHERE EXISTS (SELECT 1 FROM data t2 WHERE t2.name = t1.name GROUP BY t2.name HAVING COUNT(*) > 1);

Off the top of my head I can't tell you which of those would perform best. If data.name is indexed (and I would hope it is), the cost for either query ought to be rather low. The former will cache a copy of the subselect, whereas the latter will perform multiple queries against the index.

One very basic optimization (while leaving the table structure as a whole untouched) would be to convert the mask field to an unsigned integer data type, thereby saving many calls to CONV().

WHERE CONV(t1.mask, 2, 10) & CONV(t2.mask, 2, 10) > 0

becomes

WHERE t1.mask & t2.mask > 0

Of course, breaking the data down further does make more sense. Instead of storing a bitmask in one record, break out all the ones bits into separate records

id    name     mask
==    ====     ====
11    alpha    1101

would become

id    name    value
==    ====    =====
11    alpha       1
12    alpha       4
13    alpha       8

Now, a strategically placed index on name and value makes the query a piece of cake

SELECT name, value
FROM data
GROUP BY name, value
HAVING COUNT(*) > 1;

I hope that this helps.

Ryan Tenney 2010-08-04 07:40:44

ansaurus

tags:

views:

answers:

Finding duplicate records in mysql based on a bitmask.

related questions