views:

461

answers:

1

I have a table of the form

CREATE TABLE data
{
   pk INT PRIMARY KEY AUTO_INCREMENT,
   dt BLOB
};

It has about 160,000 rows and about 2GB of data in the blob column (avg. 14kb per blob). Another table has foreign keys into this table.

Something like 3000 of the blobs are identical. So what I want is a query that will give me a re map table that will allow me to remove the duplicates.

The naive approach took about an hour on 30-40k rows:

SELECT a.pk, MIN(b.pk) 
    FROM data AS a 
    JOIN data AS b
  ON a.dt=b.dt
  WHERE b.pk < a.pk
  GROUP BY a.pk;

I happen to have, for other reasons, a table that has the sizes of the blobs:

CREATE TABLE sizes
(
   fk INT,  // note: non-unique
   sz INT
   // other cols
);

By building indexes for both fk and another for sz the direct query from that takes about 24 sec with 50k rows:

SELECT da.pk,MIN(db.pk) 
  FROM data AS da
  JOIN data AS db
  JOIN sizes AS sa
  JOIN sizes AS sb
  ON
        sa.size=sb.size
    AND da.pk=sa.fk
    AND db.pk=sb.fk
  WHERE
        sb.fk<sa.fk
    AND da.dt=db.dt 
  GROUP BY da.pk;

However that is doing a full table scan on da (the data table). Given that the hit rate should be fairly low I'd think that an index scan would be better. With that in mind in added a 3rd copy of data as a 5th join to get that, and lost about 3 sec.

OK so for the question: Am I going to get much better than the second select? If so, how?

A bit of a corollary is: if I have a table where the key column's get very heavy use but the rest should only get rarely used, will I ever be better off adding another join of that table to encourage an index scan vs. a full table scan?


Xgc on #[email protected] points out that the adding a utility table like sizes but with a unique constraint on fk might help a lot. Some fun with triggers and what not might make it even not to bad to keep up to date.

+6  A: 

You can alway use Hashing Function(MD5, SHA1) for you data and them compare the Hashing. the question is if you can save the Hashing in your database?

Baget
+1: I agree with this. If you have to do a byte for byte comparison every time the query is going to be bogged down. Make sure the code that adds records also hashes them, and generate hashes for all existing rows. Now you need only compare the blob sizes and hashes.
Bork Blatt
+1: seems the most efficient way (at the very least, it will narrow down the number of BLOBs you need to actually compare.)
scraimer