Large ResultSet on postgresql query

views:

1039

answers:

+1 Q:

Large ResultSet on postgresql query

I'm running a query against a table in a postgresql database. The database is on a remote machine. The table has around 30 sub-tables using postgresql partitioning capability.

The query will return a large result set, something around 1.8 million rows.

In my code I use spring jdbc support, method JdbcTemplate.query, but my RowCallbackHandler is not being called.

My best guess is that the postgresql jdbc driver (I use version 8.3-603.jdbc4) is accumulating the result in memory before calling my code. I thought the fetchSize configuration could control this, but I tried it and nothing changes. I did this as postgresql manual recomended.

This query worked fine when I used Oracle XE. But I'm trying to migrate to postgresql because of the partitioning feature, which is not available in Oracle XE.

My environment:

Postgresql 8.3
Windows Server 2008 Enterprise 64-bit
JRE 1.6 64-bit
Spring 2.5.6
Postgresql JDBC Driver 8.3-603

+1 A:

I'm betting that there's not a single client of your app that needs 1.8M rows all at the same time. You should think of a sensible way to chunk the results into smaller pieces and give users the chance to iterate through them.

That's what Google does. When you do a search there might be millions of hits, but they return 25 pages at a time with the idea that you'll find what you want in the first page.

If it's not a client, and the results are being massaged in some way, I'd recommend letting the database crunch all those rows and simply return the result. It makes no sense to return 1.8M rows just to do a calculation on the middle tier.

If neither of those apply, you've got a real problem. Time to rethink it.

After reading the later responses it sounds to me like this is more of a reporting solution that ought to be crunched in batch or calculated in real time and stored in tables that are not part of your transactional system. There's no way that bringing 1.8M rows to the middle tier for calculating moving averages can scale.

I'd recommend reorienting yourself - start thinking about it as a reporting solution.

duffymo 2009-05-05 22:50:11

The database loading is done once on system startup, to massage the data and "convert" it to a more suitable format for client use.There is nothing to configure in the RowCallbackHandler itself. But a lot of configuration on the connection, datasource, driver or database design.It's a little hard to debug this on production server, where the mass data is. Logging is all I have.

tuler 2009-05-06 00:22:41

+1 A:

In order to use a cursor to retrieve data you have to set the ResultSet type of ResultSet.TYPE_FORWARD_ONLY and autocommit to false in addition to setting a fetch size. That is referenced in the doc you linked to but you didn't explicitly mention that you did those steps.

Be careful with PostgreSQL's partitioning scheme. It really does very horrible things with the optimizer and can cause massive performance issues where there should not be (depending on specifics of your data). In any case, is your row only 1.8M rows? There is no reason that it would need to be partitioned based on size alone given that it is appropriately indexed.

Trey 2009-05-06 00:50:03

Yes, I did all the steps described at the docs (TYPE_FORWARD_ONLY and autocommit false).1.8M is my current sample. The data is partitioned by day, one day in each partition. Each day has around 400k rows.

tuler 2009-05-06 01:16:40

Afraid I can't be of help then. I haven't had any issue with the pg JDBC driver and fetch size for multi-gigabyte analysis queries but I have no experience with the spring library. Just as a test, I'd try it without partitioning. How large to you expect the data set to grow to? At 400k rows per day I don't expect your current algorithm to last all that long. Without exactly knowing your use case I'd suggest you start looking at data warehousing techniques like star schemas and data dimensions.

Trey 2009-05-06 17:26:11

Thanks for the help Trey. I'm storing historical market data. I'll use 30 days of data. It'll be like a 30 days sliding window, from today to 30 days ago. Every day the inheritance will be modified to link new partitions and unlink old partitions. The data will be there for an year or so, but only 30 subtables (days) will be linked at a given time. Partition is being used to make this maintenance more feasible.

tuler 2009-05-06 20:39:28

The fetchSize property worked as described at postgres manual.

My mistake was that I was setting auto commit = false to a connection from a connection pool that was not the connection being used by the prepared statement.

Thanks for all the feedback.

tuler 2009-05-07 04:12:05

ansaurus

tags:

views:

answers:

Large ResultSet on postgresql query

related questions