tags:

views:

883

answers:

3

I have to look into solutions for providing a MySQL database that can handle data volumes in the terabyte range and be highly available (five nines). Each database row is likely to have a timestamp and up to 30 float values. The expected workload is up to 2500 inserts/sec. Queries a likely to be less frequent but could be large (maybe involving 100Gb of data) though probably only involving single tables.

I have been looking at MySQL Cluster given that is their HA offering. Due to the volume of data I would need to make use of disk based storage. Realisitcally I think only the timestamps could be held in memory and all other data would need to be stored on disk.

Does anyone have experience of using MySQL Cluster on a database of this scale? Is it even viable? How does disk based storage effect performance?

I am also open to other suggestions for how to achieve the desired availability for this volume of data. For example, would it be better to use a third party libary like Sequoia to handle the clustering of standard MySQL instances? Or a more straight forward solution based on MySQL replication?

The only condition is that it must be a MySQL based solution. I don't think that MySQL is the best way to go for the data we are dealing with but it is a hard requirement.

+1  A: 

Speed wise, it can be handled. Size wise, the question is not the size of your data, but rather the size of your index as the indices must fit fully within memory.

I'd be happy to offer a better answer, but high-end database work is very task-dependent. I'd need to know a lot more about what's going on with the data to be of further help.

Autocracy
The database will be storing a stream of timestamped data that we receive at 50Hz for a number of locations hence the 2500 inserts/sec. The configuration of the stream can change at any time hence there could be a variable number of float values. The timestamp will be the primary key and have an index. We are assuming that the timestamp column will be in memory with the rest of the data on disk.
Mark
I'd batch insert, then. One insert / client / second for multiple rows. Simple master-master replication will allow you failover and easily meet a 50 insert / second load. The only real question is how important it is avoid ever losing a sample, and I'm guessing you can deal with 2 or 3 seconds of lost data for a server crash. As an added hint, partitioning your table may be useful if you have an index other than the primary key. There may also be data warehousing tricks to speed up those large queries.
Autocracy
Thanks for the comments. We did think batch inserts would be the way to go. I done some calculations using ndb_size.pl script and you were right about the size of the the index. The memory required does not make using Cluster feasible. However, we also learned today that some data loss is OK so, as you said, we are now looking into using simple replication.
Mark
+1  A: 
Charlie Martin
Thanks for the comments and for teaching me a new word! (pessimal)To handle the variable number of samples we are thinking of creating a new table each time this changes as it should not be too often. We would then have a lookup table that would allow you to find the appropriate data table for a time period.
Mark
A: 

This article is really helpful in identifying what can slow down a large MySQL database.

Ólafur Waage