I am working on an API to query a database server (Oracle in my case) to retrieve massive amount of data. (This is actually a layer on top of JDBC.)
The API I created tries to limit as much as possible the loading of every queried information into memory. I mean that I prefer to iterate over the result set and process the returned row one by one instead of loading every rows in memory and process them later.
But I am wondering if this is the best practice since it has some issues:
- The result set is kept during the whole processing, if the processing is as long as retrieving the data, it means that my result set will be open twice as long
- Doing another query inside my processing loop means opening another result set while I am already using one, it may not be a good idea to start opening too much result sets simultaneously.
On the other side, it has some advantages:
- I never have more than one row of data in memory for a result set, since my queries tend to return around 100k rows, it may be worth it.
- Since my framework is heavily based on functionnal programming concepts, I never rely on multiple rows being in memory at the same time.
- Starting the processing on the first rows returned while the database engine is still returning other rows is a great performance boost.
In response to Gandalf, I add some more information:
- I will always have to process the entire result set
- I am not doing any aggregation of rows
I am integrating with a master data management application and retrieving data in order to either validate them or export them using many different formats (to the ERP, to the web platform, etc.)