views:

812

answers:

6

Scenario:

  • I load some data into a local MySQL database each day, about 2 million rows;
  • I have to (have to - it's an audit/regulatory thing) move to a "properly" administered server, which currently looks to be Oracle 10g;
  • The server is in a different country: a network round-trip current takes 60-70 ms;
  • Input is a CSV file in a denormalised form: I normalise the data before loading, each line typically results in 3-8 INSERTs across up to 4 tables;
  • The load script is currently implemented in Ruby, using ActiveRecord and fastercsv. I've tried the ar-extensions gem, but it assumes that the MySQL style multiple values clause idea will work. It doesn't.

EDIT: Extremely useful answers already - thank-you! More about that pesky input file. The number of fields is variable and positions have changed a few times - my current script determines content by analysing the header row (well, fastercsv and a cunning converter do it). So a straight upload and post-process SQL wouldn't work without several versions of the load file, which is horrible. Also it's a German CSV file: semi-colon delimited (no big deal) and decimals indicated by commas (rather bigger deal unless we load as VARCHAR and text-process afterwards - ugh).

The problem:

Loading 2 million rows at about 7/sec is going to take rather more than 24 hours! That's likely to be a drawback with a daily process, not to mention that the users would rather like to be able to access the data about 5 hours after it becomes available in CSV form!

I looked at applying multiple inserts per network trip: the rather ungainly INSERT ALL... syntax would be fine, except that at present I'm applying a unique id to each row using a sequence. It transpires that

INSERT ALL
    INTO tablea (id,b,c) VALUES (tablea_seq.nextval,1,2)
    INTO tablea (id,b,c) VALUES (tablea_seq.nextval,3,4)
    INTO tablea (id,b,c) VALUES (tablea_seq.nextval,5,6)
SELECT 1 FROM dual;

(did I say it was ungainly?) tries to use the same id for all three rows. Oracle docus appear to confirm this.

Latest attempt is to send multiple INSERTs in one execution, e.g.:

    INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,1,2);
    INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,3,4);
    INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,5,6);

I haven't found a way to persuade Oracle to accept that.

The Question(s)

  • Have I missed something obvious? (I'd be so pleased if that turned out to be the case!)
  • If I can't send multiple inserts, what else could I try?

Why Accept That One?

For whatever reason, I prefer to keep my code as free from platform-specific constructs as possible: one reason this problem arose is that I'm migrating from MySQL to Oracle; it's possible another move could occur one day for geographical reasons, and I can't be certain about the platform. So getting my database library to the point where it can use a text SQL command to achieve reasonable scaling was attractive, and the PL/SQL block accomplishes that. Now if another platform does appear, the change will be limited to changing the adapter in code: a one-liner, in all probability.

+10  A: 

How about shipping the csv file to the oracle db server, use SQLLoader to load the csv file into a staging table and then running a stored procedure to transform and INSERT it in the final tables?

Nuno G
This is probably the fastest way to do it. You might not need to copy the file to the oracle server either; I think you can run SQLLoader locally.
Mr. Shiny and New
Yes, you would probably choose to run it locally on the client. Very fast indeed.
David Aldridge
It's something I'm considering: do the extract and transform stages to make something that sqlldr can digest (because the input file format is, well, tricky)
Mike Woodhouse
SQL*Loader has some pretty good data manipulation/transformation tools built-in, as well as the ability to handle variable record lengths and number of fields.
DCookie
How about handling 123,456 (German for 123.456) ?
Mike Woodhouse
You can specify transformation functions in the SQLLoader control file. Like "REPLACE(:field1,',','.')".
Nuno G
+4  A: 

You could use:

insert into tablea (id,b,c)
 ( select tablea_seq.nextval,1,2 from dual union all
   select tablea_seq.nextval,3,4 from dual union all
   select tablea_seq.nextval,3,4 from dual union all
   select tablea_seq.nextval,3,4 from dual union all
   ...
 )

This works until about up to 1024 lines when I remember correctly.

You could also send it as a PL/SQL batch instruction:

BEGIN
 INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,1,2);
 INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,3,4);
 INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,5,6);
 ...
 COMMIT;
END
Kosi2801
I usually use the SQL Batch insert mentioned here, adding a call to 'ALTER SESSION ENABLE PARALLEL DML' first.
Andrew Barnett
Hae the first one, but the second... Oh-ho. Please hold, while I go hack my code... That works! I wonder if it scales. Hold again please... Hmm. 100 works, 200 doesn't. Limit on string size?
Mike Woodhouse
I haven't hit a size-limit on PL/SQL size so far but to be honest I think my scripts weren't anywhere near as big as yours would grow.Nevertheless, if you cut down round-trips by roughly a factor of 100 (100 inserts at once instead of just 1) you should save lots of time.
Kosi2801
There's a limit of 64000 "Diana nodes", which relates to the parsed Ada code. A bit like an AST, I think. For a small (5-column) table I can go to about 7500 inserts, or about 850KB of SQL. (the 200 row thing was my bug!)
Mike Woodhouse
+2  A: 

I would be loading the raw CSV file to a dedicated table in the database using SQL*Loader without normalising it, then running code against the data in the table to normalise it out to the various required tables. This would minimise the round trips, and is a pretty conventional approach in the Oracle community.

SQL*Loader can be a little challenging in terms of initial learning curve, but you can soon post again if you get stuck.

http://download.oracle.com/docs/cd/B19306_01/server.102/b14215/part_ldr.htm#i436326

David Aldridge
Working without normalisation is going to be hard: the file has variable record size, variable number of fields, position determined by column header. And it's German in origin, so decimal points are commas. But a pre-load transform is definitely an option, although it's not without pain.
Mike Woodhouse
SQL*Loader has some powerful built-in data manipulation and transformation tools. It can handle variable records sizes and number of columns as long as you can key off a value in the record.
DCookie
If you have already written the code to parse the file then spitting it out to 4 csv files then invoking sql*loader would probably be the most efficient thing then.
David Aldridge
A: 

I have a quick suggestion. I'm from the MySQL world but was trained on Oracle, I think this will work.

In MySQL you can insert multiple records with a single insert statement. It looks like this:

INSERT INTO table_name (column_one, column_two, column_three, column_four)
VALUES
    ('a', 'one', 'alpha', 'uno'),        // Row 1
    ('b', 'two', 'beta', 'dos'),         // Row 2
    ('c', 'three', 'gamma', 'tres'),     // etc.
....
    ('z', 'twenty-six', 'omega', 'veintiséis');

Now obviously you can only insert into one table at once, and you wouldn't want to do 2 million records, but you could easily do 10 or 20 or 100 at a time (if you are allowed packets that big). You may have to generate this by hand, I don't know if they frameworks you are using (or any, for that matter) support making this kind of code for you.

In the MySQL world this DRAMATICALLY speeds up inserts. I assume it does all the index updates and such at the same time, but it also prevents it from having to re-parse the SQL on each insert.

If you combine this with prepared statements (so the SQL is cached and it doesn't have to be parsed out each time) and transactions (to make sure things are always in a sane state when you have to do inserts across multiple tables... I think you will be doing pretty good.


NunoG is right that you can load the CSVs directly into Oracle. You may be best off reading in the input file, generating a normalized set of CSV files (one for each table), and then loading in each of those one at a time.

MBCook
Ah, if only! I'm actually migrating from MySQL (long story) where this works fine - pity Oracle didn't think of it. The PL/SQL block thing kinda works, though. Thanks for the input.
Mike Woodhouse
A: 

Instead of executing the SQL over the network you could write the Inserts to a text file, move it over the network and run it locally there.

Equistatic
Nice lateral thinking! I'm also looking into getting access to a workstation on site and doing the whole thing over there.
Mike Woodhouse
A: 

SQL*Loader is an Oracle-supplied utility that allows you to load data from a flat file into one or more database tables. This will be 10-100 times faster than inserts using queries.

http://www.orafaq.com/wiki/SQL*Loader_FAQ

If SQL*Loader doesn't cut it, try a small preprocessor program that formats the file in SQL*Loader readable format.

Andomar