ansaurus

Question

Edited- MySQL. Large MyISAM table (40mln records) having index that is very slow and huge in size on disk

Answer 1

A:

I had a similar situation arlier. I created a table with the same structure, let's call it the archive table. I copied the data from the active table to it every day at 3:00 and the deleted ALL the original.

The graphs and other statictics were selected from the archive table, the current events were logged to the active one.

Maybe it is not a best practice but worked sufficiently for me.

Partition table by time: Partitioning with Dates in MySQL 5.1 (Robin Schumacher)

http://dev.mysql.com/tech-resources/articles/mysql_5.1_partitioning_with_dates.html

Notinlist 2010-02-05 17:29:12

You can also try to open new table every year or month. Oracle has some similar feature called time based partitioning.

Notinlist 2010-02-05 17:33:40

Yes Notinlist, thank you! I will turn to this solution as a last resort, because it will cost me a large application redesign effort.

PatlaDJ 2010-02-06 10:50:59

Answer 2

+1 A:

I think you don't need an index on some_other_id_not_fk (as you said there are just 7 different values, so selectivity of that index is 40,000,000/7 ) . All you need is 1 index on (event_time + [maybe] some_other_id_not_fk);

a1ex07 2010-02-05 17:33:00

Thank you about thi, I will try.Do you think that using index only on `event_time` will greatly reduce the size of the index? it is now 7Gb in contrast to size of the data which is only 1Gb. I just couldn't believe why such a huge size of index is needed? Something wrong in my approach? Logic suggests me that I don't need two long letter pages of index writings to mark that my red sweaters are in drawer N1, blue sweaters in N2, and green in N3. So if I look for my red sweater#11, I'd need to read several pages of indexing data do find out that I must only search drawer N1 for it!

PatlaDJ 2010-02-05 18:09:01

It will definitely reduces size of the indexes. I'd suggest about 30% of your 7 GB. When you create an index on field1, DB Engine stores index value + unique record identifier for each record in the table.

a1ex07 2010-02-05 19:11:47

Answer 3

+2 A:

I think your intuition on what's heavy and what's not is backwards: an index with many repetitions of a few different options is much worse than an index with lots of distinct values and few repetitions of each.

My suggestion: drop the index on some_other_id_not_fk and keep (some_other_id_not_fk, event_time). This compound index should be 'almost unique', making insert overhead much lower. If possible, drop the event_time key also, unless you have queries that use that field without some_other_id_not_fk.

edit: you say that you have to select by time interval, then keep (event_time, some_other_id_not_fk) and drop both event_time and some_other_id_not_fk. if you have queries that use some_other_id_not_fk and not event_time, then keep both (event_time, some_other_id_not_fk) and (some_other_id_not_fk, event_time). the point is not have any index with few options. having an index with unused fields on the right is ok.

Javier 2010-02-05 17:54:48

Your answer was extremely good and helpfull. My next few hours would be waiting to see how re-indexing by drop all indexes, and put only (`event_time`, `some_other_id_not_fk`) would perform. I hope my only problem is the solo index on `some_other_id_not_fk`. And yes, my queries are on `event_time` and `some_other_id_not_fk` both, sometimes `some_other_id_not_fk` gets filtered by IN(1,2,6,x,...) function. I just didn't think that having solo index on `some_other_id_not_fk` could have such a huge impact on the size of the index and overhead as long as it has very few distinct values. Mamma Mia!

PatlaDJ 2010-02-05 18:36:21

Answer 4

+3 A:

`id` bigint(20) unsigned NOT NULL auto_increment,

Do you really need a BIGINT? You can probably get away with an INT. If you were to insert 1,000 rows per second 24 hours a day, it would take 136 years for you to exhaust all values in an unsigned 32-bit integer.

This change will decrease your table size by 152.5 MB for 40 million rows, and will decrease the size of your primary key index by 158.8 MB for 40 million rows.

`some_other_id_not_fk` int(10) unsigned default NOT NULL,

You state this has only 7 distinct values. Does it need to be an INT type then? Could you use TINYINT instead? This will drastically reduce index size.

This will decrease the size of your table by 114.4 MB for 40 million rows, and will decrease the size of the some_other_id_not_fk index by approximately the same.

`event_time` datetime NOT NULL,

Do you need a DATETIME? DATETIME's take 8 bytes, a TIMESTAMP takes 4 bytes. If you can use a TIMESTAMP then this will drastically reduce data and index size. Be aware of the limitations of TIMESTAMP fields though such as Y2K38 and how they behave with respect to timezones and replication.

This change will decrease your table size by 152.5 MB for 40 million rows, and will decrease the size of your primary key index by 158.8 MB for 40 million rows.

These three changes will significantly reduce the size of your data as well as the indices.

Total Space Savings

Table: 152.5 + 152.5 + 114.4 = 419.4 MB
Index: 158.8 + 158.8 + ~115 = 432.6 MB

Total: 852MB

As others have suggested, you may not even need all the indices that you have defined. With such a low selectivity on some_other_id_not_fk there's a good chance the query optimizer won't even use that index and will instead opt for a full table scan. Dropping this index completely would result in a significant space savings for your indices.

If you could provide some sample queries, I can help you further.

Also, are you inserting into this table under a heavy read load? Keep in mind that SELECTs in MyISAM will block an INSERT.

Update

Most people are suggesting moving your some_other_id_not_fk field into the event_time index so the new index would be on (event_time, some_other_id_not_fk). I will recommend the same, but with an important caveat.

This index will be good for queries where you are filtering only on event_time, or if you filter on both event_time and some_other_id_not_fk. It will not be used for queries filtering only on some_other_id_not_fk - a full table scan will occur.

Moreover, if your queries are always filtering on both event_time and some_other_id_not_fk then do not use the index order of (event_time, some_other_id_not_fk). Rather, you should use the index (some_other_id_not_fk, event_time) instead.

Having the least selective (most duplicates) field first will allow for much greater compression for your index and thus a significantly reduced footprint on disk.

hobodave 2010-02-05 18:32:25

Thank you very much! "Pure class" :)

PatlaDJ 2010-02-05 19:08:04

I will need some time before closing this thread by choosing the right answer, may be several hours. I need to wait to re-index my table to see the result and post my own final comments upon.

PatlaDJ 2010-02-05 19:09:58

i concur on usually "Having the least selective field first"; but it's not as simple. PatlaDJ mentions he uses IN (x,y,...); within a datetime range. this kind of queries are better served by having the range _first_. but probably on this extreme case of non-selectivity (7 values for 40mill!), it could be the other way around. still, getting rid of such a useless index (7 values for 40mill!) will definitely improve INSERT times.

Javier 2010-02-05 20:01:21

Answer 5

A:

I've dropped all indexes and made an index on (event_time, some_other_id_not_fk). I get the following performance indicators:

1Gb Data size on disk, 1.2Gb Index size on disk.
delete from event where event.event_time>STR_TO_DATE('20091201000000','%Y%m%d%H%i%s') and event.some_other_id_not_fk=4 | Affected rows: 353543 Time: 65.173 seconds
select * from event where event.event_time>STR_TO_DATE('20090401000000','%Y%m%d%H%i%s') and event.event_time<=STR_TO_DATE('20090401010000','%Y%m%d%H%i%s') and event.some_other_id_not_fk in (22,4,1,3) | 916 rows in set, Query time: 0.030 seconds
index enabled insertion of 350,000 new records using the following format: insert into event VALUES(...),(...),... | performed in about 30 seconds, Yeahaaaaaa :))
index disable - insertion - index enable - of 350,000 new records using the same format: insert into event VALUES(...),(...),... | performed in about 40 minutes. :) Looks like mysql default dump format, disabling index before inserts and re-enabling it after, is not always good for performance, especially when large size indexes are present:)

For now I am satisfied with this performance.

Last evening I've managed to create index only on (event_time). The size of the index was slightly lower than the first example. About 1.1Gb. The performance of same queries as listed above:

the delete | slightly faster, about 30 seconds
the select | slightly slower, about 0.1 seconds.
I've only tested index disable-enable insertiton of 350,000. It was very slow again | about 35 minutes.

I have rejected this state of the database, because I wasnt satisfied enough of the select speed, which is priority N1 for me.

hobodave, I am just curious, do you think that making the index on (some_other_id_not_fk,event_time) instead of (event_time,some_other_id_not_fk) will really change something dramatic towards better? My queries will ALWAYS filter on both fields. I shall NEVER have a query without filtering by some_other_id_not_fk. But I may have a queries that filter by IN(x,y,...) most of the distinct some_other_id_not_fk's. As I said, they are not many.

My priorities are:

select speed
insert speed
index size on disk (as the table will grow several times more)
... everything else

And I also wonder why such a huge index size is required 1.2Gb on 1Gb data? Index still greater then data. My logic suggests me that this kind of indexing of dates can be done in much smaller index? Am i correct? Is there something related to the index type which is probably BTREE?

Thank you. You're all great. I am closing the thread.

PatlaDJ 2010-02-06 11:14:43

ansaurus

tags:

views:

answers:

Edited- MySQL. Large MyISAM table (40mln records) having index that is very slow and huge in size on disk

Total Space Savings

Update

related questions