views:

55

answers:

2

I'm building a MySQL database which contains entries about special substrings of DNA in species of yeast. My table looks like this:

+--------------+---------+------+-----+---------+-------+
| Field        | Type    | Null | Key | Default | Extra |
+--------------+---------+------+-----+---------+-------+
| species      | text    | YES  | MUL | NULL    |       |
| region       | text    | YES  | MUL | NULL    |       |
| gene         | text    | YES  | MUL | NULL    |       |
| startPos     | int(11) | YES  |     | NULL    |       |
| repeatLength | int(11) | YES  |     | NULL    |       |
| coreLength   | int(11) | YES  |     | NULL    |       |
| sequence     | text    | YES  | MUL | NULL    |       |
+--------------+---------+------+-----+---------+-------+

There are approximately 1.8 million records. In one type of query I want to see how many DNA substrings are associated with each type of species and region, so I issue this query:

select species, region, count(*) group by species, region;

The species and region columns have only two possible entries (conserved/scer for species, and promoter/coding for region) yet this query takes about 30 seconds.

Is this a normal amount of time to expect for this type of query given the size of the table? Is it slow because I'm using text fields instead of simple integer or boolean values (I prefer text fields as several non-CS researchers will be using the DB). Any other ideas and suggestions would be welcome.

Please excuse if this is a boneheaded question, I am an SQL neophyte.

P.S. I've also seen this question but the proposed solution doesn't seem relevant for what I'm doing.

EDIT: Converting those fields to VARCHARs reduced the runtime to ~2.5 seconds. Note I also timed it against ENUMs which had a similar timing.

+3  A: 

If your fields are only ever going to have 2 values, you're much better off making them booleans. You should also make everything NOT NULL unless there's a real reason you'll need it to be NULL.

Also take a look at the ENUM type for a better way to use a finite number of human-readable values for a column.

As for slowness, the first thing to try is to create indices on your columns. For the particular query you're showing here, an index on species, region should make a huge difference:

create index on mytablename (species, region);

should do it.

Vineet
Are you sure the index will make a huge difference with such low-cardinality data?
Daniel Vassallo
No, I'm not sure of it, but I think it's a good guess. I started writing some about using `EXPLAIN`, but it started to turn into a can of worms. And I guessed the end result would probably be that we should try creating an index anyway.
Vineet
I tried the index, but it made no difference. I also tried VARCHAR as OMG Ponies suggested which was much faster. After that I tried it against enums with no noticeable speedup from VARCHARs.
Rich
Also, I just checked out EXPLAIN; very cool!
Rich
+1 for make it `NOT NULL`, -1 for saying `make it boolean`. Something I read in a SQL newsgroup years ago: "I usually use a 1 character flag. In my experience, when I initially think I have a binary status, it is really a multi-varied flag. In other words, as soon as I code to tell if the door is open or closed, someone else wants to know if it's locked."
onedaywhen
+5  A: 

Why're all your string based columns defined as TEXT? If you read the performance comparison, you'll see that TEXT was ~3x slower than a VARCHAR column using identical indexing: http://forums.mysql.com/read.php?24,105964,105964

OMG Ponies
Good catch. Didn't notice they were `text`.
Daniel Vassallo
I did TEXT because a colleague of mine said there wouldn't be any difference between that and VARCHAR. :) Using a VARCHAR took my runtime from 33 seconds to 2.5.
Rich
@Rich: Wow - wasn't expecting such a dramatic difference. You might get lower if you changed the species and region columns to be foreign keys to tables holding their respective values. An INT is always 4 bytes, while a VARCHAR(4) is 5 so you can imagine how many bytes VARCHAR(100) is.
OMG Ponies