views:

226

answers:

5

We have a part of an application where, say, 20% of the time it needs to read in a huge amount of data that exceeds memory limits. While we can increase memory limits, we hesitate to do so to since it requires having a high allocation when most times it's not necessary.

We are considering using a customized java.util.List implementation to spool to disk when we hit peak loads like this, but under lighter circumstances will remain in memory.

The data is loaded once into the collection, subsequently iterated over and processed, and then thrown away. It doesn't need to be sorted once it's in the collection.

Does anyone have pros/cons regarding such an approach?

Is there an open source product that provides some sort of List impl like this?

Thanks!

Updates:

  • Not to be cheeky, but by 'huge' I mean exceeding the amount of memory we're willing to allocate without interfering with other processes on the same hardware. What other details do you need?
  • The application is, essentially a batch processor that loads in data from multiple database tables and conducts extensive business logic on it. All of the data in the list is required since aggregate operations are part of the logic done.
  • I just came across this post which offers a very good option: http://stackoverflow.com/questions/1068477/stxxl-equivalent-in-java
A: 

Is there sorting/processing that's going on while the data is being read into the collection? Where is it being read from?

If it's being read from disk already, would it be possible to simply batch-process it directly from disk, instead of reading it into a list completely and then iterating? How inter-dependent is the data?

Amber
Each element of the list is a domain object, with an hierarchy of related objects for each element. It is being read in from several (15-25) database tables in various queries, some cached and some not.
eqbridges
+4  A: 

Do you really need to use a List? Write an implementation of Iterator (it may help to extend AbstractIterator) that steps through your data instead. Then you can make use of helpful utilities like these with that iterator. None of this will cause huge amounts of data to be loaded eagerly into memory -- instead, records are read from your source only as the iterator is advanced.

Kevin Bourrillion
This is definitely an option, and one we're seriously considering. What I'm hoping for is an iterator (or list) implementation that can read from memory or, when an internal buffer is exceeded, from disk.
eqbridges
Well, if it just reads from an InputStream, then BufferedInputStream takes care of the buffer for you.
Kevin Bourrillion
+2  A: 

If you're working with huge amounts of data, you might want to consider using a database instead.

rob
+1. A database in combination with a query which only returns a specific resultset would likely be a better idea.
BalusC
the data is stored in a database, across about 20 different tables, loaded in at various points in the application execution, each data point having various expiration dates. A specific part of the application is, essentially, a bottleneck (poor design) and large quantities of data end up there.
eqbridges
I was just going to suggest using a database-backed collection (a combination of Kevin Bourrilion's suggestion and mine), but after reading your updated question, it looks like one of the other solutions you've found on your own does just that.
rob
+1  A: 

Back it up to a database and do lazy loading on the items.

An ORM framework may be in order. It depends on your usage. It may be pretty straight forward, or the worst of your nightmares it is hard to tell from what you've described.

I'm optimist and I think that using a ORM framework ( such as Hibernate ) would solve your problem in about 3 - 5 days

OscarRyz
BTW, you'll have to remove items as you iterate them, otherwise they'll remain in memory anyway. :)
OscarRyz
Best idea so far. Let the magic of SQL handle any searching or scanning! Also depending on how contentious your updates are this makes the solution scalable accross many machines.
James Anderson
A: 

I would also question why you need to load all of the data in memory to process it. Typically, you should be able to do the processing as it is being loaded and then use the result. That would keep the actual data out of memory.

James Bailey
aggregate operations need to be done on the data.
eqbridges
You can do the aggregation while loading the data.Interator iter = ...int sum = 0;while(iter.hasNext()) { sum += iter.next();}return sum;
James Bailey
It's more complex than simple sums. There's extensive logic behind what and when to do operations on the data. (The domain is financial trade accounting on a wide variety of financial instruments).
eqbridges