views:

71

answers:

4

A couple of years ago I wrote a small utility to move data from an Oracle db to a Postgres db. I used Java and JDBC to accomplish this because I wanted Java to handle the data formatting for data used in a prepared statement to do the insert. The original version of the utility assumed that the table names and column names were the same in both databases. Later versions accepted a mapping file to handle name differences. This utility was a big hit in my organization but, unfortunately, it did not scale. It maxes out at about a million rows moved per hour. We now have tables with 30+ million rows and nobody is willing to wait 30 hours for their data to transfer.

The method below is the heart of the utility and the reason it does not scale. This method is executed once for each column of data, so it gets called (num_rows*num_cols times). With a profiler I see that this method consumes 58% of the execution time. The getObject() and findColumn() calls alone account for 53% of execution time!

    public void setPlaceholderValue ( int placeHolderNum, ResultSet rs, String oracleColumnName, PreparedStatement stmt ) throws Exception {

    int columnIndex = rs.findColumn(oracleColumnName) ;
    int columnType = rs.getMetaData().getColumnType(columnIndex) ;

    try{
        if ( rs.getObject(oracleColumnName) != null ){
            switch (columnType) {
                case Types.VARCHAR: stmt.setString(placeHolderNum,  rs.getString(columnIndex)); break;
                case Types.INTEGER:   stmt.setInt(placeHolderNum, rs.getInt(columnIndex)); break ;
                case Types.DATE:       stmt.setDate(placeHolderNum, rs.getDate(columnIndex)); break;
                case Types.FLOAT:      stmt.setFloat(placeHolderNum, rs.getFloat(columnIndex)); break ;
                case Types.NUMERIC:  stmt.setBigDecimal(placeHolderNum,rs.getBigDecimal(columnIndex)); break ;
                case Types.TIMESTAMP:      stmt.setTimestamp(placeHolderNum, rs.getTimestamp(columnIndex)); break ;
                default: throw new SQLException("The result set column type " +  rs.getMetaData().getColumnType(columnIndex) + " was not recognized. see the java.sql.Types class at http://java.sun.com/j2se/1.5.0/docs/api/ ");
            }
        } else {
            stmt.setNull(placeHolderNum, columnType);
        }
    } catch (SQLException e){
        System.out.println ("SQLException: " + e.getMessage() + " for record id=" + rs.getLong("id"));
        throw new SQLException("rethrow");
    }
}

I not sure I can refactor this method to bring down the transfer time sufficiently. I think the column by column approach simply does not scale.

Can anyone suggest a better way of doing this? Language is not an issue, I can do it with anything that can handle the job. Ideally, I would like to see a transfer rate of at least 10 million records per hour.

+4  A: 

I'd suggest to use the DB-provided export/import tools. Oracle and PostgreSQL supports both XML and CSV formats.

If you'd like to stick to JDBC, then place the columns in the SELECT query of the ResultSet in the same order as the values in the INSERT query of the PreparedStatement and just do the following instead of that whole if/switch block:

preparedStatement.setObject(index, resultSet.getObject(index));

However, I don't expect that this will greatly improve the performance. The DB-provided export/import facilities can do it much more efficient than one can ever do in Java.

BalusC
Seconded. The best way to accomplish this (IMHO) is to write queries against the Oracle tables that map the data to the PostgreSQL tables and run those to create the CSV files. Then, use the PostgreSQL COPY command to import the data. 10M rows per hour is nothing at that point--I routinely import 28M into a single table, build indexes, and derive two aggregate tables all in under an hour in PostgreSQL.
Matthew Wood
I shall have to investigate this.
Tony Ennis
A: 

You might try creating a class of some sort (perhaps an array of some variety) to hold information on the result set columns and types, which are invariant while processing a given result set. Then pull the values out of the array rather than making the calls to findColumn and getColumnType each time you need them. This should cut down greatly on the calls to findColumn and getColumnType and should help improve your runtime.

Good luck.

Bob Jarvis
+1  A: 

This line is likely to be problematic:

if ( rs.getObject(oracleColumnName) != null ){

Instead use:

if ( rs.getObject(columnIndex) != null ){

In Oracle getObject(String) is O(n) - at least on 10g it was. This method looks like it is called for each column for every row in the result set. You shouldn't fetch the metadata on every call. Move the all of the calls that relate to the metadata to happen once for each query and pass them through to this method as you move through the result set.

Michael Barker
A: 

The information you retrieve is constant for the whole handling of that table.

In order to execute this only once per table, you could create a dataholder containing members for column name, column index, column type and the hasColumn flag. Pass an instance of this data object instead of the columnName, initialise it the first time and use the data for the rest of the table. Cashing data like this will save you 2*num_rows*num_cols calls retrieving meta data.

rsp