ansaurus

Question

MySQL Inserting large data sets from file with Java

Answer 1

+1 A:

You should really use LOAD DATA on the MySQL console itself for this and not work through the code...

LOAD DATA INFILE 'data.txt' INTO TABLE db2.my_table;

If you need to manipulate the data, I would still recommend manipulating in memory, rewriting to a flat file, and pushing it to the database using LOAD DATA, I think it should be more efficient.

Roee Adler 2009-06-30 21:53:18

-1 he said he needed to manipulate the data before putting it into the DB

Hardwareguy 2009-06-30 21:54:09

@Hardwareguy: please see the change I added (before I saw you comment :)

Roee Adler 2009-06-30 21:54:42

I'll remove my minus one but I still don't think that's the best way.

Hardwareguy 2009-06-30 22:02:04

Answer 2

A:

Wouldn't it be faster if you used LOAD DATA INFILE instead of inserting each row ?

Pierre 2009-06-30 21:53:39

-1. Quote from question: "The reason I'm not piping it straight in from the file into the db, is the data has to be manipulated before it adds it to the database"

PatrikAkerstrand 2009-06-30 21:55:28

I saw this: manipulate your data, save it into a temporary file, call "load data infile", delete the temporary file.

Pierre 2009-06-30 22:00:17

Answer 3

+6 A:

Tips for fast insertion:

Use the LOAD DATA INFILE syntax to let MySQL parse it and insert it, even if you have to mangle it and feed it after the manipulation.
Use this insert syntax:

insert into table (col1, col2) values (val1, val2), (val3, val4), ...
Remove all keys/indexes prior to insertion.
Do it in the fastest machine you've got (IO-wise mainly, but RAM and CPU also matter). Both the DB server, but also the inserting client, remember you'll be paying twice the IO price (once reading, the second inserting)

Vinko Vrsalovic 2009-06-30 21:56:56

Having the file on the server is by far the fastest, but if you don't have that kind of access, you can still use LOAD DATA LOCAL INFILE. Just make sure to use a compressed connection if it's a big file.

Wouter van Nifterick 2009-07-01 12:16:35

thanks, I actually in the end loaded the data in direct as it was using load data local file. I then wrote a series of somewhat complex sql queries to create another temp table in the format I wanted. Total time is now down to 30 seconds for 1.8 million records. Not bad from the orignal 10 hours the original developer created. This all done in mysql, no java required at all.

Derek Organ 2010-08-04 23:46:08

Answer 4

+2 A:

I'd probably pick a large number, like 10k rows, and load that many rows from the CSV, massage the data, and do a batch update, then repeat until you've gone through the entire csv. Depending on the massaging/amount of data 1.8 mil rows shouldn't take 10 hours, more like 1-2 hours depending on your hardware.

edit: whoops, left out a fairly important part, your con has to have autocommit set to false, the code I copied this from was doing it as part of the GetConnection() method.

    Connection con = GetConnection();
con.setAutoCommit(false);
         try{
          PreparedStatement ps = con.prepareStatement("INSERT INTO table(col1, col2) VALUES(?, ?)");
          try{
           for(Data d : massagedData){
            ps.setString(1, d.whatever());
                                        ps.setString(2, d.whatever2());
                                            ps.addBatch();
           }
           ps.executeBatch();
          }finally{
           ps.close();
          }
         }finally{
          con.close();
         }

Hardwareguy 2009-06-30 22:00:21

1-2 hours is still slow as hell. LOAD FROM INFILE finishes in a matter of seconds if it's in the same format as the table, especially if the .csv file already resides on the server. Try it, it's blazingly fast.You usually don't want an incomplete dataset in your database, so you're going to have to use a transaction and lock tables.. I personally don't know any production servers where locking tables for 1-2 hours is acceptable.

Wouter van Nifterick 2009-07-01 12:11:27

He said 1.8 mil rows. Also this is a temp table so it's not going to lock any other tables.

Hardwareguy 2009-07-01 16:16:27

Answer 5

+1 A:

Another idea: do you use a PreparedStatement for inserting your data with JDBC ?

Pierre 2009-06-30 22:02:02

PreparedStatements with addBatch are the way I always do this.

Hardwareguy 2009-06-30 22:05:22

interesting, does this offer any performance improvements? currently using ....opencsv.CSVReader to parse the file then doing an insert on each line. I'm concating some columns though and ignoring others.

Derek Organ 2009-06-30 22:08:28

There is a lot of overhead in simply making a connection to the database. You'll see a huge speedup by batching inserts.

Hardwareguy 2009-06-30 22:44:50

Answer 6

A:

I would run three threads...

1) Reads the input file and pushes each row into a transformation queue 2) Pops from the queue, transforms the data, and pushes into a db queue 3) Pops from the db queue and inserts the data

In this manner, you can be reading data from disk while the db threads are waiting for their IO to complete and vice-versa

Reed 2009-06-30 22:02:08

That sounds good in theory, but thread 3 is where 95% of the work is going to happen so really you're not going to gain much by parallelizing the thread 1 and 2 tasks.

Hardwareguy 2009-06-30 22:12:46

Of course that depends on the transformations. In my experience that can involve many database lookups to valid the fields. If the source file is on a different disk than the database files there still should be some performance increase. If they have to be on the same disk, I would definitely bulk things into 1000 rows or more to reduce head seeks.

Reed 2009-06-30 22:20:13

Answer 7

A:

If you're not already, try using the MyISAM table type, just be sure to read up on its shortcomings before you do. It is generally faster than the other types of tables.

If your table has indexes, it is usually faster to drop them then add them back after the import.

If your data is all strings, but is better suited as a relational database, you'll be better off inserting integers that indicate other values rather than storing a long string.

But in general, yes adding data to a database takes time.

Greg Miller 2009-06-30 22:05:00

Answer 8

+1 A:

Depending on what exactly you need to do with the data prior to inserting it your best options in terms of speed are:

Parse the file in java / do what you need with the data / write the "massaged" data out to a new CSV file / use "load data infile" on that.

If your data manipulation is conditional (e.g. you need to check for record existence and do different things based on whether it's an insert or and update, etc...) then (1) may be impossible. In which case you're best off doing batch inserts / updates.
Experiment to find the best batch size working for you (starting with about 500-1000 should be ok). Depending on the storage engine you're using for your table, you may need to split this into multiple transactions as well - having a single one span 1.8M rows ain't going to do wonders for performance.

ChssPly76 2009-06-30 22:06:14

Answer 9

A:

Your biggest performance problem is most likely not java but mysql, in particular any indexes, constraints, and foreign keys you have on the table you are inserting into. Before you begin your inserts, make sure you disable them. Re-enabling them at the end will take a considerable amount of time, but it is far more efficient than having the database evaluate them after each statement.

You may also be seeing mysql performance problems due to the size of your transaction. Your transaction log will grow very large with that many inserts, so performing a commit after X number of inserts (say 10,000-100,000) will help insert speed as well.

From the jdbc layer, make sure you are using the addBatch() and executeBatch() commands rather on your PreparedStatement rather than the normal executeUpdate().

Nathan Voxland 2009-07-01 04:18:22

Answer 10

+2 A:

Are you absolutely CERTAIN you have disabled auto commits in the JDBC driver?

This is the typical performance killer for JDBC clients.

Thorbjørn Ravn Andersen 2009-07-01 11:48:23

I'm not... I'll check this out.. thanks.

Derek Organ 2009-07-01 12:03:36

Answer 11

A:

This is an interesting read: http://dev.mysql.com/doc/refman/5.1/en/insert-speed.html

Wouter van Nifterick 2009-07-01 12:18:31

Answer 12

A:

You can improve bulk INSERT performance from MySQL / Java by using the batching capability in its Connector J JDBC driver.

MySQL doesn't "properly" handle batches (see my article link, bottom), but it can rewrite INSERTs to make use of quirky MySQL syntax, e.g. you can tell the driver to rewrite two INSERTs:

INSERT INTO (val1, val2) VALUES ('val1;, 'val2'); 
INSERT INTO (val1, val2) VALUES ('val3;, 'val4');

as a single statement:

INSERT INTO (val1, val2) VALUES ('val1;, 'val2'), ('val3','val4');

We did this for a bulk insert investigation of our own: it made an order of magnitude of difference. Used with explicit transactions as mentioned by others and you'll see a big improvement overall.

The relevant driver property setting is:

jdbc:mysql:///<dbname>?rewriteBatchedStatements=true

See: A 10x Performance Increase for Batch INSERTs With MySQL Connector/J Is On The Way

Brian 2010-02-19 16:00:52

ansaurus

tags:

views:

answers:

MySQL Inserting large data sets from file with Java

related questions