views:

241

answers:

1

I am working on an API to query a database server (Oracle in my case) to retrieve massive amount of data. (This is actually a layer on top of JDBC.)

The API I created tries to limit as much as possible the loading of every queried information into memory. I mean that I prefer to iterate over the result set and process the returned row one by one instead of loading every rows in memory and process them later.

But I am wondering if this is the best practice since it has some issues:

  • The result set is kept during the whole processing, if the processing is as long as retrieving the data, it means that my result set will be open twice as long
  • Doing another query inside my processing loop means opening another result set while I am already using one, it may not be a good idea to start opening too much result sets simultaneously.

On the other side, it has some advantages:

  • I never have more than one row of data in memory for a result set, since my queries tend to return around 100k rows, it may be worth it.
  • Since my framework is heavily based on functionnal programming concepts, I never rely on multiple rows being in memory at the same time.
  • Starting the processing on the first rows returned while the database engine is still returning other rows is a great performance boost.

In response to Gandalf, I add some more information:

  • I will always have to process the entire result set
  • I am not doing any aggregation of rows

I am integrating with a master data management application and retrieving data in order to either validate them or export them using many different formats (to the ERP, to the web platform, etc.)

A: 

There is no universal answer. I personally implemented both solutions dozens of times.

This depends of what matters more for you: memory or network traffic.

If you have a fast network connection (LAN) and a poor client machine, then fetch data row by row from the server.

If you work over the Internet, then batch fetching will help you.

You can set prefetch count or your database layer properties and find a golden mean.

Rule of thumb is: fetch everything that you can keep without noticing it

if you need more detailed analysis, there are six factors involved:

  • Row generation responce time / rate(how soon Oracle generates first row / last row)
  • Row delivery response time / rate (how soon can you get first row / last row)
  • Row processing response time / rate (how soon can you show first row / last row)

One of them will be the bottleneck.

As a rule, rate and responce time are antagonists.

With prefetching, you can control the row delivery response time and row delivery rate: higher prefetch count will increase rate but decrease response time, lower prefetch count will do the opposite.

Choose which one is more important to you.

You can also do the following: create separate threads for fetching and processing.

Select just ehough rows to keep user amused in low prefetch mode (with high response time), then switch into high prefetch mode.

It will fetch the rows in the background and you can process them in the background too, while the user browses over the first rows.

Quassnoi
Based on your rule of thumb, I understand that if I had unlimited amount of memory, I should fetch every records at once. But my problem with this option is that fetching 100k records takes time and will delay the start of processing of these records. Fetching them one by one allow me to start the processing as records get fetched and limit CPU usage since my processing actually takes place between each record fetching.
Vincent Robert
Thank you for suggesting the analysis. I will analyze those values and try to make the best decision. Thanks for the suggestions too but my application is not user-oriented bue data-oriented, I need to export as much data as fast as possible.
Vincent Robert