tags:

views:

209

answers:

3

I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.

The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.

ALTER TABLE memtable ENGINE = MyISAM

I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.

If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.

One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.

At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.


I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?


Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.

+3  A: 

If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.

You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.

Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.

Brenton Alker
I updated the question to answer a few of your questions. I've found sources claiming a CSV insert runs faster, but it seems fairly asinine to dump data to a multi-gigabyte CSV **then** load it to the database. That's adding a huge deal of sluggish hard drive I/O to the problem.
brianreavis
But you're dumping into an SQL file with a set of INSERTS anyway aren't you. I don't see how a CSV file is any different IO wise? I've added a paragraph to explain another method that may fix the "importing into MyISAM is too slow" issue.
Brenton Alker
I cannot remove the uniqueness constraints. I am using `ON DUPLICATE KEY UPDATE` to update a record if the primary key (a hash) already exists in the able. If I were to remove the constraint, the table would likely be over 1000 times the size (and that's not an exaggeration). There is a huge deal of UPDATEs going on---it's not just a raw dump into the database.
brianreavis
I'm establishing a connection to the database in C, and piping the queries straight to the database.
brianreavis
Right, I would have made the C just output the imports and piped it into the MySQL CLI tool on the command line, but either way. In that case, maybe the "remove index, import, delete, create index" method might work for you. The advantage is, the constraints are only checked when the index is being created (at the end) not on every insert.
Brenton Alker
The currently problem isn't the time it takes for MySQL to read/interpret a query. Running via the CLI won't help matters. As stated earlier, I cannot use the "remove index, import, delete, create index" method. If I were to, the table would approach and possibly go over 150 *billion* entries... that size simply isn't manageable and is certainly not the solution. Trying to ensure uniqueness with a dataset that large would be a nightmare even larger than what I have now.
brianreavis
Sorry, I missed the comment about the majority of the queries being updates and "remove index, import, delete, create index" not being an option. Given what you have said, I don't think there is a much better solution than those proposed, you need to process the dataset at some point, which means either before, during or after the import. If you can't do this in the C before you pass it to MySQL (which is likely difficult, but if the file is sorted for example it might be easy), then you can let MySQL do it in the way you are... (cont.)
Brenton Alker
If you can't do that (you say it's too slow) you need to do it after the import (which you say gives you a table that is too big). The only other option is adding a step, such as dumping from C to a file, and pre-processing that or the memory table option you are already using. If this is a one-off import as you say, then I think your only real option is to pick one (I'm leaning towards just importing straight to MyISAM for simplicity (considered innoDB?)) and letting it go. Even if it takes 3 days, it going to be quicker than spending a week figuring out a "faster" way.
Brenton Alker
+1  A: 

Sorry to keep throwing comments at you (last one, probably).

I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.

Brenton Alker
InnoDB would be fine. This is very clever... I like it. Thanks for throwing comments at me. I appreciate it. :)
brianreavis
A: 

I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.

Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?

HLGEM