ansaurus

Question

Generating a massive 150M-row MySQL table

Answer 1

+3 A:

If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.

You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.

Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.

Brenton Alker 2009-10-19 00:36:52

I updated the question to answer a few of your questions. I've found sources claiming a CSV insert runs faster, but it seems fairly asinine to dump data to a multi-gigabyte CSV **then** load it to the database. That's adding a huge deal of sluggish hard drive I/O to the problem.

brianreavis 2009-10-19 00:45:43

But you're dumping into an SQL file with a set of INSERTS anyway aren't you. I don't see how a CSV file is any different IO wise? I've added a paragraph to explain another method that may fix the "importing into MyISAM is too slow" issue.

Brenton Alker 2009-10-19 00:56:40

I cannot remove the uniqueness constraints. I am using `ON DUPLICATE KEY UPDATE` to update a record if the primary key (a hash) already exists in the able. If I were to remove the constraint, the table would likely be over 1000 times the size (and that's not an exaggeration). There is a huge deal of UPDATEs going on---it's not just a raw dump into the database.

brianreavis 2009-10-19 01:00:00

I'm establishing a connection to the database in C, and piping the queries straight to the database.

brianreavis 2009-10-19 01:00:40

Right, I would have made the C just output the imports and piped it into the MySQL CLI tool on the command line, but either way. In that case, maybe the "remove index, import, delete, create index" method might work for you. The advantage is, the constraints are only checked when the index is being created (at the end) not on every insert.

Brenton Alker 2009-10-19 01:04:29

The currently problem isn't the time it takes for MySQL to read/interpret a query. Running via the CLI won't help matters. As stated earlier, I cannot use the "remove index, import, delete, create index" method. If I were to, the table would approach and possibly go over 150 *billion* entries... that size simply isn't manageable and is certainly not the solution. Trying to ensure uniqueness with a dataset that large would be a nightmare even larger than what I have now.

brianreavis 2009-10-19 01:09:37

Sorry, I missed the comment about the majority of the queries being updates and "remove index, import, delete, create index" not being an option. Given what you have said, I don't think there is a much better solution than those proposed, you need to process the dataset at some point, which means either before, during or after the import. If you can't do this in the C before you pass it to MySQL (which is likely difficult, but if the file is sorted for example it might be easy), then you can let MySQL do it in the way you are... (cont.)

Brenton Alker 2009-10-19 01:36:03

If you can't do that (you say it's too slow) you need to do it after the import (which you say gives you a table that is too big). The only other option is adding a step, such as dumping from C to a file, and pre-processing that or the memory table option you are already using. If this is a one-off import as you say, then I think your only real option is to pick one (I'm leaning towards just importing straight to MyISAM for simplicity (considered innoDB?)) and letting it go. Even if it takes 3 days, it going to be quicker than spending a week figuring out a "faster" way.

Brenton Alker 2009-10-19 01:39:41

Answer 2

+1 A:

Sorry to keep throwing comments at you (last one, probably).

I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.

Brenton Alker 2009-10-19 01:55:50

InnoDB would be fine. This is very clever... I like it. Thanks for throwing comments at me. I appreciate it. :)

brianreavis 2009-10-19 02:24:28

Answer 3

A:

I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.

Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?

HLGEM 2009-10-19 17:52:15

ansaurus

tags:

views:

answers:

Generating a massive 150M-row MySQL table

related questions