views:

108

answers:

2

Consider to following method that read data from some data-structure (InteractionNetwork) and writes them to a table in an SQLite database using an SQLite-JDBC dirver:

private void loadAnnotations(InteractionNetwork network) throws SQLException {
    PreparedStatement insertAnnotationsQuery = 
        connection.prepareStatement(
        "INSERT INTO Annotations(GOId, ProteinId, OnthologyId) VALUES(?, ?, ?)");
    PreparedStatement getProteinIdQuery = 
        connection.prepareStatement(
        "SELECT Id FROM Proteins WHERE PrimaryUniProtKBAccessionNumber = ?");
    connection.setAutoCommit(false);
    for(common.Protein protein : network.get_protein_vector()) {
        /* Get ProteinId for the current protein from another table and
           insert the value into the prepared statement. */
        getProteinIdQuery.setString(1, protein.get_primary_id());
        ResultSet result = getProteinIdQuery.executeQuery();
        result.next();
        insertAnnotationsQuery.setLong(2, result.getLong(1));
        /* Extract all the other data and add all the tuples to the batch. */
    }
    insertAnnotationsQuery.executeBatch();
    connection.commit();
    connection.setAutoCommit(true);
}

This code works fine, the program runs in about 30 seconds and takes an average of 80m heap space. Because the code looks ugly, I want to refactor it. The first thing I did was moving the declaration of getProteinIdQuery into the loop:

private void loadAnnotations(InteractionNetwork network) throws SQLException {
    PreparedStatement insertAnnotationsQuery = 
        connection.prepareStatement(
        "INSERT INTO Annotations(GOId, ProteinId, OnthologyId) VALUES(?, ?, ?)");
    connection.setAutoCommit(false);
    for(common.Protein protein : network.get_protein_vector()) {
        /* Get ProteinId for the current protein from another table and
           insert the value into the prepared statement. */
        PreparedStatement getProteinIdQuery = // <--- moved declaration of statement here
            connection.prepareStatement(
            "SELECT Id FROM Proteins WHERE PrimaryUniProtKBAccessionNumber = ?");
        getProteinIdQuery.setString(1, protein.get_primary_id());
        ResultSet result = getProteinIdQuery.executeQuery();
        result.next();
        insertAnnotationsQuery.setLong(2, result.getLong(1));
        /* Extract all the other data and add all the tuples to the batch. */
    }
    insertAnnotationsQuery.executeBatch();
    connection.commit();
    connection.setAutoCommit(true);
}

What happens when I run the code now is that it takes about 130m heap space and takes an eternity to run. Can anyone explain this strange behavior?

+1  A: 

Hi,

I guess it is a matter of taste if the first fragment looks ugly ;-)...

However, the reason why the second code fragment takes longer (IMHO) is that now for every iteration of the for loop, a new instance of PreparedStatement (getProteinIdQuery) gets created whereas in the first fragment, you reused the prepared statement, using it the way it was meant to be: Instantiated and then supplied with proper values.

At least, that's my opinion... Jan

Jan
+1  A: 

Preparing a statement takes time, as you've found out. Whether the code is ugly or not, that decrease in speed is very ugly too so you need to use the faster form.

But what you could do is use an inner class to hold the details and provide a nicer interface:

private class DatabaseInterface {
    private PreparedStatement insertAnnotation, getProteinId;
    public DatabaseInterface() {
        // This is an inner class; 'connection' is variable in outer class
        insertAnnotation = connection.prepareStatement(
            "INSERT INTO Annotations(GOId, ProteinId, OnthologyId) VALUES(?, ?, ?)");
        getProteinId = connection.prepareStatement(
            "SELECT Id FROM Proteins WHERE PrimaryUniProtKBAccessionNumber = ?");
    }
    public long getId(Protein protein) { // Exceptions omitted...
        getProteinId.setString(1, protein.get_primary_id());
        ResultSet result = getProteinId.executeQuery();
        try {
            result.next();
            return result.getLong(1);
        } finally {
            result.close();
        }
    }
    public void insertAnnotation(int GOId, long proteinId, String ontologyId) {
        insertAnnotation.setInt(1, GOId);          // type may be wrong
        insertAnnotation.setLong(2, proteinId);
        insertAnnotation.setString(3, ontologyId); // type may be wrong
        insertAnnotation.executeUpdate();
    }
}
private void loadAnnotations(InteractionNetwork network) throws SQLException {
    connection.setAutoCommit(false);
    DatabaseInterface dbi = new DatabaseInterface();
    for(common.Protein protein : network.get_protein_vector()) {
        dbi.insertAnnotation(..., dbi.getId(protein), ...);
    }
    connection.commit();
    connection.setAutoCommit(true);
}

The aim is that you have one piece of code that knows about mangling things into SQL (and which is easy to adapt if you go to a different database) and another piece of code that knows about how to coordinate these things together.

Donal Fellows
Note also that you could think in terms of sharing the glue class more widely; it's tied to a connection, and not a particular operation.
Donal Fellows
Thank you very much, it works very well.
Space_C0wb0y