ansaurus

Question

Strategy for dealing with large db tables

Answer 1

+4 A:

Partition columns in MySQL are not limited to primary keys. In fact, a partition column does not have to be a key at all (though one will be created for it transparently). You can partition by RANGE, HASH, KEY and LIST (which is similar to RANGE only that it is a set of discrete values). Read the MySQL manual for an overview of partioning types.

There are alternative solutions such as HScale - a middleware plug-in that transparently partitions tables based on certain criteria. HiveDB is an open-source framework for horizontal partioning for MySQL.

In addition to sharding and partioning you should employ some sort of clustering. The simplest setup is a replication based setup that helps you spread the load over several physical servers. You should also consider more advanced clustering solutions such as MySQL cluster (probably not an option due to the size of your database) and clustering middleware such as Sequioa.

I actually asked a relevant question regarding scaling with MySQL here on stack-overflow some time ago, which I ended up answering myself several days later after collecting a lot of information on the subject. Might be relevant for you as well.

Eran Galperin 2008-11-27 16:39:37

Answer 2

A:

Thanks for your answer and the link to the scaling with MySQL thread a while back. I'd already been planning some kind of clustering/replication to spread the load.

To clarify my point about the primary keys, it is my understanding that if you have a primary key field then that must be included in the partitioning expression:

Partitioning Keys, Primary Keys, and Unique Keys

I don't really want to have a primary key based upon any of the partitioning columns. This is the syntax that I was trying to use to create the table:

print("CREATE TABLE `annotations` 
    ( `id` int(11) DEFAULT NULL auto_increment PRIMARY KEY,
      `value` varchar(255), 
     `task_id` int(11), 
     `project_id` int(11), 
     `user_id` int(11), 
     `created_at` datetime, 
     `updated_at` datetime) ENGINE=InnoDB
     PARTITION BY RANGE(task_id)(
     PARTITION low VALUES LESS THAN (10),
     PARTITION high VALUES LESS THAN (20)
     )")

Thanks

Arfon

arfon 2008-11-27 16:54:38

Where version of MySQL are you running?

Eran Galperin 2008-11-27 17:16:49

5.1.29-rc at the moment

arfon 2008-11-27 17:19:31

As far as I can tell, the queries are good. Notice though that it's two separate queries, in case you were trying to run them as one

Eran Galperin 2008-11-27 19:35:28

Answer 3

+1 A:

If you want to split your datas by time, the following solution may fit to your need. You can probably use MERGE tables;

Let's assume your table is called MyTable and that you need one table per week

Your app always logs in the same table
A weekly job atomically renames your table and recreates an empty one: MyTable is renamed to MyTable-Year-WeekNumber, and a fresh empty MyTable is created
Merge tables are dropped and recreated.

If you want to get all the datas of the past three months, you create a merge table which will include only the tables from the last 3 months. Create as many merge tables as you need different periods. If you can not include the table in which datas are currently inserted (MyTable in our example), you'll be even more happy, as you won't have any read / write concurrency

2008-11-27 18:03:28

Answer 4

+1 A:

You can handle this entirely in Active Record using DataFabric.

It's not that complicated to implement similar behavior yourself if that's not suitable. Google sharding for a lot of discussion on the architectural pattern of handling table partitioning within the app tier. It has the advantages of avoiding middleware or depending on db vender specific features. On the other hand it is more code in your app that you're responsible for.

Jason Watkins 2008-11-30 09:34:11

ansaurus

tags:

views:

answers:

Strategy for dealing with large db tables

related questions