views:

413

answers:

5

The table contains about 40,000,000 records having:

CREATE TABLE `event` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `some_other_id_not_fk` int(10) unsigned default NOT NULL,
  `event_time` datetime NOT NULL,
  `radius` float default NULL,
  `how_heavy` smallint(6) default NULL,
  PRIMARY KEY  (`id`),
  KEY `event_some_other_id_not_fk` (`some_other_id_not_fk`),
  KEY `event_event_time` (`event_time`)
) ENGINE=MyISAM AUTO_INCREMENT=6506226 DEFAULT CHARSET=utf8 

You should know that some_other_id_not_fk column is not big, it contains distinctively only 7 different numbers. The real pain is the event_time datetime column, as it contains extremely large amounts of different datetime's, and basicly everything is allowed: duplicates as well as unpredictably large time intervals without records to 'cover' them. You should also know that (some_other_id_not_fk,event_time) pair must be allowed to have duplicates either :( I know this causes even more problems :(

I've had some experience in optimizing MySQL tables, but such a huge pain had never appeared on my horizon :/

The current state of 'the things' is:

  • The selects by event_time between date1 and date2 (which I need to do) are satisfactorily fast. :)
  • My inserts are slow, I mean really SLOW!!! more then a 30 secs, and even worse: LOAD DATA procedures that temporary DISABLE and ENABLE KEYS are EXTREMELY slow(several hours), mainly on ENABLE keys operation.
  • The size of the index on the disk is 7 times bigger then the size of the data

I would have tried several different combinations of re-indexing till now, but the size of that data really prevents me from experimenting on indexes and columns drop/create at will.

Please help anyone had managed this ? Should using timestamp instead of datetime solve my problem? Or maybe I should add additional columns for day, year,... etc and index on them ?

A: 

I had a similar situation arlier. I created a table with the same structure, let's call it the archive table. I copied the data from the active table to it every day at 3:00 and the deleted ALL the original.

The graphs and other statictics were selected from the archive table, the current events were logged to the active one.

Maybe it is not a best practice but worked sufficiently for me.

Partition table by time: Partitioning with Dates in MySQL 5.1 (Robin Schumacher)

http://dev.mysql.com/tech-resources/articles/mysql_5.1_partitioning_with_dates.html

Notinlist
You can also try to open new table every year or month. Oracle has some similar feature called time based partitioning.
Notinlist
Yes Notinlist, thank you! I will turn to this solution as a last resort, because it will cost me a large application redesign effort.
PatlaDJ
+1  A: 

I think you don't need an index on some_other_id_not_fk (as you said there are just 7 different values, so selectivity of that index is 40,000,000/7 ) . All you need is 1 index on (event_time + [maybe] some_other_id_not_fk);

a1ex07
Thank you about thi, I will try.Do you think that using index only on `event_time` will greatly reduce the size of the index? it is now 7Gb in contrast to size of the data which is only 1Gb. I just couldn't believe why such a huge size of index is needed? Something wrong in my approach? Logic suggests me that I don't need two long letter pages of index writings to mark that my red sweaters are in drawer N1, blue sweaters in N2, and green in N3. So if I look for my red sweater#11, I'd need to read several pages of indexing data do find out that I must only search drawer N1 for it!
PatlaDJ
It will definitely reduces size of the indexes. I'd suggest about 30% of your 7 GB. When you create an index on field1, DB Engine stores index value + unique record identifier for each record in the table.
a1ex07
+2  A: 

I think your intuition on what's heavy and what's not is backwards: an index with many repetitions of a few different options is much worse than an index with lots of distinct values and few repetitions of each.

My suggestion: drop the index on some_other_id_not_fk and keep (some_other_id_not_fk, event_time). This compound index should be 'almost unique', making insert overhead much lower. If possible, drop the event_time key also, unless you have queries that use that field without some_other_id_not_fk.

edit: you say that you have to select by time interval, then keep (event_time, some_other_id_not_fk) and drop both event_time and some_other_id_not_fk. if you have queries that use some_other_id_not_fk and not event_time, then keep both (event_time, some_other_id_not_fk) and (some_other_id_not_fk, event_time). the point is not have any index with few options. having an index with unused fields on the right is ok.

Javier
Your answer was extremely good and helpfull. My next few hours would be waiting to see how re-indexing by drop all indexes, and put only (`event_time`, `some_other_id_not_fk`) would perform. I hope my only problem is the solo index on `some_other_id_not_fk`. And yes, my queries are on `event_time` and `some_other_id_not_fk` both, sometimes `some_other_id_not_fk` gets filtered by IN(1,2,6,x,...) function. I just didn't think that having solo index on `some_other_id_not_fk` could have such a huge impact on the size of the index and overhead as long as it has very few distinct values. Mamma Mia!
PatlaDJ
+3  A: 
`id` bigint(20) unsigned NOT NULL auto_increment,

Do you really need a BIGINT? You can probably get away with an INT. If you were to insert 1,000 rows per second 24 hours a day, it would take 136 years for you to exhaust all values in an unsigned 32-bit integer.

This change will decrease your table size by 152.5 MB for 40 million rows, and will decrease the size of your primary key index by 158.8 MB for 40 million rows.

`some_other_id_not_fk` int(10) unsigned default NOT NULL,

You state this has only 7 distinct values. Does it need to be an INT type then? Could you use TINYINT instead? This will drastically reduce index size.

This will decrease the size of your table by 114.4 MB for 40 million rows, and will decrease the size of the some_other_id_not_fk index by approximately the same.

`event_time` datetime NOT NULL,

Do you need a DATETIME? DATETIME's take 8 bytes, a TIMESTAMP takes 4 bytes. If you can use a TIMESTAMP then this will drastically reduce data and index size. Be aware of the limitations of TIMESTAMP fields though such as Y2K38 and how they behave with respect to timezones and replication.

This change will decrease your table size by 152.5 MB for 40 million rows, and will decrease the size of your primary key index by 158.8 MB for 40 million rows.

These three changes will significantly reduce the size of your data as well as the indices.

Total Space Savings

  • Table: 152.5 + 152.5 + 114.4 = 419.4 MB
  • Index: 158.8 + 158.8 + ~115 = 432.6 MB

Total: 852MB

As others have suggested, you may not even need all the indices that you have defined. With such a low selectivity on some_other_id_not_fk there's a good chance the query optimizer won't even use that index and will instead opt for a full table scan. Dropping this index completely would result in a significant space savings for your indices.

If you could provide some sample queries, I can help you further.

Also, are you inserting into this table under a heavy read load? Keep in mind that SELECTs in MyISAM will block an INSERT.

Update

Most people are suggesting moving your some_other_id_not_fk field into the event_time index so the new index would be on (event_time, some_other_id_not_fk). I will recommend the same, but with an important caveat.

This index will be good for queries where you are filtering only on event_time, or if you filter on both event_time and some_other_id_not_fk. It will not be used for queries filtering only on some_other_id_not_fk - a full table scan will occur.

Moreover, if your queries are always filtering on both event_time and some_other_id_not_fk then do not use the index order of (event_time, some_other_id_not_fk). Rather, you should use the index (some_other_id_not_fk, event_time) instead.

Having the least selective (most duplicates) field first will allow for much greater compression for your index and thus a significantly reduced footprint on disk.

hobodave
Thank you very much! "Pure class" :)
PatlaDJ
I will need some time before closing this thread by choosing the right answer, may be several hours. I need to wait to re-index my table to see the result and post my own final comments upon.
PatlaDJ
i concur on usually "Having the least selective field first"; but it's not as simple. PatlaDJ mentions he uses IN (x,y,...); within a datetime range. this kind of queries are better served by having the range _first_. but probably on this extreme case of non-selectivity (7 values for 40mill!), it could be the other way around. still, getting rid of such a useless index (7 values for 40mill!) will definitely improve INSERT times.
Javier
A: 

I've dropped all indexes and made an index on (event_time, some_other_id_not_fk). I get the following performance indicators:

  • 1Gb Data size on disk, 1.2Gb Index size on disk.

  • delete from event where event.event_time>STR_TO_DATE('20091201000000','%Y%m%d%H%i%s') and event.some_other_id_not_fk=4 | Affected rows: 353543 Time: 65.173 seconds

  • select * from event where event.event_time>STR_TO_DATE('20090401000000','%Y%m%d%H%i%s') and event.event_time<=STR_TO_DATE('20090401010000','%Y%m%d%H%i%s') and event.some_other_id_not_fk in (22,4,1,3) | 916 rows in set, Query time: 0.030 seconds

  • index enabled insertion of 350,000 new records using the following format: insert into event VALUES(...),(...),... | performed in about 30 seconds, Yeahaaaaaa :))

  • index disable - insertion - index enable - of 350,000 new records using the same format: insert into event VALUES(...),(...),... | performed in about 40 minutes. :) Looks like mysql default dump format, disabling index before inserts and re-enabling it after, is not always good for performance, especially when large size indexes are present:)

For now I am satisfied with this performance.

Last evening I've managed to create index only on (event_time). The size of the index was slightly lower than the first example. About 1.1Gb. The performance of same queries as listed above:

  • the delete | slightly faster, about 30 seconds
  • the select | slightly slower, about 0.1 seconds.
  • I've only tested index disable-enable insertiton of 350,000. It was very slow again | about 35 minutes.

    I have rejected this state of the database, because I wasnt satisfied enough of the select speed, which is priority N1 for me.

hobodave, I am just curious, do you think that making the index on (some_other_id_not_fk,event_time) instead of (event_time,some_other_id_not_fk) will really change something dramatic towards better? My queries will ALWAYS filter on both fields. I shall NEVER have a query without filtering by some_other_id_not_fk. But I may have a queries that filter by IN(x,y,...) most of the distinct some_other_id_not_fk's. As I said, they are not many.

My priorities are:

  1. select speed
  2. insert speed
  3. index size on disk (as the table will grow several times more)
    ... everything else

And I also wonder why such a huge index size is required 1.2Gb on 1Gb data? Index still greater then data. My logic suggests me that this kind of indexing of dates can be done in much smaller index? Am i correct? Is there something related to the index type which is probably BTREE?

Thank you. You're all great. I am closing the thread.

PatlaDJ