views:

344

answers:

2

Hello,

I'm trying to insert about 500 million rows of garbage data into a database for testing. Right now I have a PHP script looping through a few SELECT/INSERT statements each inside a TRANSACTION -- clearly this isn't the best solution. The tables are InnoDB (row-level locking).

I'm wondering if I (properly) fork the process, will this speed up the INSERT process? At the rate it's going, it will take 140 hours to complete. I'm concerned about two things:

  1. If INSERT statements must acquire a write lock, then will it render forking useless, since multiple processes can't write to the same table at the same time?

  2. I'm using SELECT...LAST_INSERT_ID() (inside a TRANSACTION). Will this logic break when multiple processes are INSERTing into the database? I could create a new database connection for each fork, so I hope this would avoid the problem.

  3. How many processes should I be using? The queries themselves are simple, and I have a regular dual-core dev box with 2GB RAM. I set up my InnoDB to use 8 threads (innodb_thread_concurrency=8), but I'm not sure if I should be using 8 processes or if this is even a correct way to think about matching.

Thanks for your help!

+1  A: 

1) yes, there will be lock contention, but innodb is designed to handle multiple threads trying to insert. sure, they won't simultaneously insert, but it will handle serializing the inserts for you. just make sure you specifically close your transactions and you do it ASAP. this will ensure you get the best possible insert performance.

2) no, this logic will not break provided you have 1 connection per thread, since last_insert_id() is connection specific.

3) this is one of those things that you just need to benchmark to figure out. actually, i would make the program self-adjust. run 100 inserts with 8 threads and record the execution times. then try again with half as many and twice as many. whichever one is faster, then benchmark more thread count values around that number.

in general, you should always just go ahead and benchmark this kind of stuff to see which is faster. in the amount of time it takes you to think about it and write it up, you could probably already have preliminary numbers.

longneck
Thanks for the detailed response! Glad to know there's not much cause for concern.
Jasie
+2  A: 

The MySQL documentation has a discussion on efficient insertion of a large number of records. It seems that the clear winner is usage of the LOAD DATA INFILE command, followed by inserts that insert multiple values lists.

Martin v. Löwis
Thanks for that tip! 20x faster, excellent.
Jasie