views:

1501

answers:

4

Hi!

Since I kicked off the process of inserting 7M rows from one table into two others, I'm wondering now if there's a faster way to do this. The process is expected to finish in an hour, that's 24h of processing.

Here's how it goes:

The data from this table

RAW (word VARCHAR2(4000), doc VARCHAR2(4000), count NUMBER);

should find a new home in two other cluster tables T1 and T2

CREATE CLUSTER C1 (word VARCHAR2(4000)) SIZE 200 HASHKEYS 10000000;
CREATE CLUSTER C2 (doc VARCHAR2(4000)) SIZE 200 HASHKEYS 10000000;

T1 (word VARCHAR2(4000), doc VARCHAR2(4000), count NUMBER) CLUSTER C1(word);
T2 (doc VARCHAR2(4000), word VARCHAR2(4000), count NUMBER) CLUSTER C2(doc);

through Java inserts with manual commit like this

stmtT1 = conn.prepareStatement("insert into T1 values(?,?,?)");
stmtT2 = conn.prepareStatement("insert into T2 values(?,?,?)");

rs = stmt.executeQuery("select word, doc, count from RAW");

conn.setAutoCommit(false);

while (rs.next()) {
    word = rs.getString(1);
    doc = rs.getString(2);
    count = rs.getInt(3);

    if (commitCount++==10000) { conn.commit(); commitCount=0; }

    stmtT1.setString(1, word);
    stmtT1.setString(2, doc);
    stmtT1.setInt(3, count);

    stmtT2.setString(1, doc);
    stmtT2.setString(2, word);
    stmtT2.setInt(3,count);

    stmtT1.execute();
    stmtT2.execute();
}

conn.commit();

Any ideas?

+3  A: 

The first thing I'd recommend is to do a simple insert-select statement, and let the database handle all the data movement. Not so useful if you're moving data between two machines, or if you don't have rollback segments large enough to handle the entire query.

The second thing I is to learn about the addBatch() method. As you've written your code, it makes a round-trip to the database for every row that you're inserting, which adds network overhead.

Third, unless you already have lots of rows in the destination tables is to drop any indexes before your inserts, and recreate afterward. If you leave the indexes in place, they have to be updated for every row, adding to the dirty block overhead.

And finally: do you need clustered tables? My experience has been that they don't buy you a lot (caveat: that experience was on a single tablespace).

kdgregory
hey thanks for your advice. I'm using Java to monitor the process of the data transfer (how many rows left), a normal INSERT AS SELECT doesn't tell me. I'm always accessing the tables in the same way, like select * from T1 where word='foo'; for that hash tables are best I think.
chris
Based on past experience with hash-indexed databases (Teradata), I would turn to hash clusters only if you're joining multiple tables on the same key -- and as noticed, I didn't see much benefit when I tried that. For your query, a normal B-tree index is probably best.
kdgregory
If you use INSERT INTO .. SELECT FROM syntax you can use v$longops to see where it's up to.
WW
+1 except that hash clustering used properly is a real saver. One logical read to find all the realted records from multiple tables could be an order of magnitude better than alternatives.
David Aldridge
A: 

Hi,

Unless you have some special reason to handle data in app, I would go for direct INSERT AS SELECT. Using Parallel DML can give you tremendous difference.

Check also INSERT ALL syntax (1 read for 2 writes) if that fits your needs.

Unless you have IO problems, 1h should be more than enough...

Regards

+1  A: 

Well, you can't call a table RAW in Oracle -- it's a reserved word so an ORA-00903 error will be raised.

That aside, you would use:

insert all
into t1
into t2
select * from RAW
/

"Row-by-row equals slow-by-slow" :)

David Aldridge
A: 

Conceptually similar to addBatch, you could write a PL/SQL procedure that accepts arrays of (word, doc, count) and processes the inserts on the server side. It is conceptually similar since you are reducing network trips by sending up multiple records in one shot and you may achieve faster performance. On the other hand, it is more complicated and brittle since it requires writing the PL/SQL on the server side and will require additional array logic on the client side. Oracle TechNet has a few examples of this.

//Nicholas

Nicholas