tags:

views:

292

answers:

4

Can anyone tell me how adding a key scales in MySQL? I have 500,000,000 rows in a database, trans, with columns i (INT UNSIGNED), j (INT UNSIGNED), nu (DOUBLE), A (DOUBLE). I try to index a column, e.g.

ALTER TABLE trans ADD KEY idx_A (A);

and I wait. For a table of 14,000,000 rows it took about 2 minutes to execute on my MacBook Pro, but for the whole half a billion, it's taking 15hrs and counting. Am I doing something wrong, or am I just being naive about how indexing a database scales with the number of rows?

+1  A: 

From my experience: if the hardware can cope with it, indexing large tables with MySQL usualy scales pretty linear. I have tried it with tables of about 100.000.000 rows so far, but not on a notebook - mainly on strong servers.

I guess it depends mainly on hardware factors, the kind of table engine you're using (MyIsam, INNO or whatever) and a bit if the table is otherwise in use in between. When I was doing it, usually disk usage jumped sky high, unlike cpu usage. Not sure about the harddiscs of the MacBook, but I guess they aren't the fastest around.

If you're having MyISAM tables, maybe have a closer look at the index files in the table directory and see how it changes over the time.

Björn
Thanks for the quick reply,Björn. I followed your suggestion. I take it the files#sql-a8_6.MYD (currently 7,455,506,432 bytes) and#sql-a8_6.MYI (currently 2,148,865,024 bytes)are the new version of the database being built and the index I requested respectively?Then if the original table istrans.MYD (12,645,156,375 bytes)I'm about 60% done?It's beginning to look like I'd be better off splitting the huge table into 20-or-so smaller tables.Thanks,Christian
Overall that should be it. Well, it all depends on what you want to do with that amount of data. 500.000.000 rows are a lot, so if you want to do some fancy reports afterwards, try minimizing the data. Either try to split it, or take a loot at MySQLs partitioning features (from version 5.1).
Björn
A: 

you might want to look into splitting ( database sharding ) your table

Mihai Secasiu
+1  A: 

Firstly, your table definition could make a big difference here. If you don't need NULL values in your columns, define them NOT NULL. This will save space in the index, and presumably time while creating it.

CREATE TABLE x ( 
  i INTEGER UNSIGNED NOT NULL, 
  j INTEGER UNSIGNED NOT NULL, 
  nu DOUBLE NOT NULL, 
  A DOUBLE NOT NULL 
);

As for the time taken to create the indexes, this requires a table scan and will show up as REPAIR BY SORTING. It should be quicker in your case (i.e. massive data set) to create a new table with the required indexes and insert the data into it, as this will avoid the REPAIR BY SORTING operation as the indexes are built sequentially on the insert. There is a similar concept explained in this article.

CREATE DATABASE trans_clone;
CREATE TABLE trans_clone.trans LIKE originalDB.trans;
ALTER TABLE trans_clone.trans ADD KEY idx_A (A);

Then script the insert into chunks (as per the article), or dump the data using MYSQLDUMP:

mysqldump originalDB trans  --extended-insert --skip-add-drop-table --no-create-db --no-create-info > originalDB .trans.sql
mysql trans_clone < originalDB .trans.sql

This will insert the data, but will not require an index rebuild (the index is built as each row is inserted) and should complete much faster.

Andy
A: 

There are a couple of factors to consider:

  • Sorting is a N.log(N) operation.
  • The sort for 14M rows might well fit in main memory; the sort with 500M rows probably doesn't, so the sort spills to disk, which slows things up enormously.

Since the factor is about 30 in size, the nominal sort time for the big data set would be of the order of 50 times as long - under two hours. However, you need 8 bytes per data value and about another 8 bytes of overhead (that's a guess - tune to mySQL if you know more about what it stores in an index). So, 14M × 16 ≈ 220 MB main memory. But 500M × 16 ≈ 8 GB main memory. Unless your machine has that much memory to spare (and MySQL is configured to use it), then the big sort is spilling to disk and that accounts for a lot of the rest of the time.

Jonathan Leffler
Thanks a lot, that makes sense to me - I've only got 4 GB. Looks like splitting (partitioning?) the data up as suggested above makes a lot of sense.