views:

1201

answers:

3

I am running queries against an Oracle 10g with JDBC (using the latest drivers and UCP as DataSource) in order to retrieve CLOBs (avg. 20k characters). However the performance seems to be pretty bad: the batch retrieval of 100 LOBs takes 4s in average. The operation is also neither I/O nor CPU nor network bound judging from my observations.

My test setup looks like this:

    PoolDataSource dataSource = PoolDataSourceFactory.getPoolDataSource();
    dataSource.setConnectionFactoryClassName("...");
    dataSource.setConnectionPoolName("...");
    dataSource.setURL("...");
    dataSource.setUser("...");
    dataSource.setPassword("...");

    dataSource.setConnectionProperty("defaultRowPrefetch", "1000");
    dataSource.setConnectionProperty("defaultLobPrefetchSize", "500000");

    final LobHandler handler = new OracleLobHandler();
    JdbcTemplate j = new JdbcTemplate(dataSource);

    j.query("SELECT bigClob FROM ...",

            new RowCallbackHandler() {

                public void processRow(final ResultSet rs) throws SQLException {

                    String result = handler.getClobAsString(rs, "bigClob");

                }

            });

    }

I experimented with the fetch sizes but to no avail. Am I doing something wrong? Is there a way to speed up CLOB retrieval when using JDBC?

+1  A: 

The total size of the result set is in the ten thousands - measured over the span of the whole retrieval the initial costs

Is there an Order By in the query? 10K rows is quite a lot if it has to be sorted.

Also, retrieving the PK is not a fair test verses retrieving the entire CLOB. Oracle stores the table rows with probably many in a block, but each of the CLOBs (if they are > 4K) will be stored out of line, each in a series of blocks. Scanning the list of PK's is therefore going to be fast. Also, there is probably an index on the PK, so Oracle can just quickly scan the index blocks and not even access the table.

4 seconds does seem a little high, but it is 2MB that needs to be possible read from disk and transported over the network to your Java program. Network could be an issue. If you perform an SQL trace of the session it will point you at exactly where the time is being spent (disk reads or network).

Stephen ODonnell
+1  A: 

My past experience of using oracle LOB type data to store large data has not been good. It is fine when it is under 4k since it store it locally like varchar2. Once it is over 4k, you start seeing performance degrade. Perhaps, things may have improved since I last tried it a couple of years ago, but here are the things I found in the past for your information:

As clients need to get LOBs via oracle server, you may consider the following interesting situation.

  • lob data will compete limited SGA cache with other data type if oracle decide to cache it. As clob data are general big, so it may push other data
  • lob data get poor disk read if oracle decide not to cache it, and stream the data to the client.
  • fragmentation is probably something that you haven't encountered yet. You will see if your applications delete lobs, and oracle tries to reuse the lob. I don't know if oracle support online defragmenting the disk for lob (they have for indexes, but it takes long time when we tried it previous).

You mentioned 4s for 100 lobs of avg 20k, so it's 40ms per lobs. Remember each lob needs to have to retrieved via separate Lob locater (it is not in the result set by default). That is an additional round trip for each lob, I assume (I am not 100% sure on this since it was a while ago) If that is the case, I assume that will be at least 5ms extra time per round trip in serial order, right? If so, your performance is already first limited by sequential lob fetches. You should be able to verify this by tracking the time spent in sql execution vs lob content fetching. Or you can verify this by excluding the lob column as suggested by the previous answer in the post, which should tell you if it is lob related.

Good luck

Oscar Chan
A: 

Thanks for all the helpful suggestions. Despite being flagged as answer to the problem my answer is that there seems to be no good solution. I tried using parallel statements, different storage characteristics, presorted temp. tables and other things. The operation seems not to be bound to any characteristic visible through traces or explain plans. Even query parallelism seems to be sketchy when CLOBs are involved.

Undoubtedly there would be better options to deal with with large CLOBs (especially compression) in an 11g environment but atm. I am stuck with 10g.

I have opted now for an additional roundtrip to the database in which I'll preprocess the CLOBs into a size optimized binary RAW. In previous deployments this has always been a very fast option and will likely be worth the trouble of maintaining an offline computed cache. The cache will be invalided and update using a persistent process and AQ until someone comes up with a better idea.

yawn