tags:

views:

96

answers:

5

I have a table with 2.4M+ rows, and no indexes. I am 100% sure all the rows have one column (we'll call this id) that is unique, it is of type VARCHAR(255).

I now have a file of approximately 10,000 id's and need to pull the entire row for each.

Is using IN(...) my best option? Should I add an index?

I was thinking for some thinking of something like this:

SELECT * FROM archive_table WHERE id IN('id1', 'id2', ... 'idn');

This is effectively archived data and only accessed by me every few weeks.


System: MySQL 5.0.45 Table: MyISAM

A: 

Hell yes you should add an index. But if the id is a "primary key", then it already is an index.

GoatRider
+2  A: 

Add an index on the ID column and (optionally) define it as UNIQUE. This will help MySQL to quickly locate the rows you want, because the index contains the ID in sorted order. Even if your table were sorted, too, e. g. because you insert in increasing ID order, MySQL does not know that and will always do a full table scan to find the matching records for your queries.

With the index on the other hand, the search becomes very easy for the server. Only if you ask for really, really many rows at once (very long IN() clause), the optimizer might decide that you want more than about 30% of the data - in which case it will fall back to a linear scan again to prevent excessive disk seeking.

However with several million rows this would be hell of a long condition :)

I'd also recommend reconsidering if the column really has to be 255 characters long - even though VARCHAR will not use that much space when you don't need to, it sound like a questionable design. Whether it should be a numeric field or not may depend on your needs, however it is usually recommended.

Daniel Schneller
The 'id' column is a alpha-numeric string of variable length, however it gets nowhere near 255 characters long.
rjstelling
+3  A: 

Since you have a file of the ids you want, I recommend importing it into a work table and then joining the table to your production table to get the results you want. Of course , before you do anything, you need to implement an index strategy.

HLGEM
A: 

From what I've been led to understand by our DBA in the past, the "IN" clause has a limit on how many explicit IDs can be specified within the brackets. I was informed this doesn't apply if you can use a SELECT to feed the IN list.

HLGEM's suggestion of importing and using a join to the table is probably the easiest approach. And, as mentioned before, indexing will improve the performance.

Jay S
A: 

Yes, add an index over both tables (2.4mil and 10,000).

Assuming transaction_table is 10,000 rows, archive_table is 2.4mil rows and you've built an index overr archive_table you could code:

SELECT id
  FROM transaction_table a
 WHERE EXISTS( SELECT *
                 FROM archive_table b
                WHERE a.id = b.id )

Using an EXISTS clause over a JOIN is more readable and has the same performance as the join.

Paul Morgan