ansaurus

Question

How to store millions of Double during a calculation?

Answer 1

+1 A:

From your description, it appears you will not be able to easily improve your memory usage. The size of a double is fixed, and if you need to retain all results until your final processing, you will not be able to reduce the size of that data.

If you need to reduce your memory usage, but can accept a longer run time, you could replace the Map<String, List<Double>> with a List<Double> and only process a single deal at a time.

If you have to have all the values from all the deals, your only option is to increase your available memory. Your calculation of the memory usage is based on just the size of a value and the number of values. Without a way to decrease the number of values you need, no data structure will be able to help you, you just need to increase your available memory.

Alan Geleynse 2010-10-14 15:16:49

Looks like we were writing similar things at the same time!

Dommer 2010-10-14 15:21:15

Indeed. I should have indicated that I must run all deals at the same time. See my edit.

romaintaz 2010-10-14 15:29:01

If you need all the deals, you aren't going to be able to reduce your memory usage. It isn't overhead of the data structure that is filling your memory, it is the values you need. I updated my answer to include this.

Alan Geleynse 2010-10-14 15:32:44

Answer 2

+2 A:

Just to clarify, do you need ALL of the information in memory at once? It sounds like you are doing financial simulations (maybe credit risk?). Say you are running 30 deals, do you need to store all of the values in memory? Or can you run the first deal (~900,000 * 64bits), then discard the list of double (serialize it to disk or something) and then proceed with the next? I thought this might be okay as you say the deals are independent of one another.

Apologies if this sounds patronising; I'm just trying to get a proper idea of the problem.

Dommer 2010-10-14 15:20:07

You are right, this is for financial calculations. See my edit.

romaintaz 2010-10-14 15:28:25

Answer 3

+1 A:

The flippant answer is to get a bunch more memory. Sun JVM's can (almost happily) handle multi gigabyte heaps and if it's a batch job then longer GC pauses might not be a massive issue.

You may decide that this not a sane solution, the first thing to attempt would be to write a custom list like collection but have it store primitive doubles instead of the object wrapper Double objects. This will help save the per object overhead you pay for each Double object wrapper. I think the Apache common collections project had primitive collection implementations, these might be a starting point.

Another level would be to maintain the list of doubles in a nio Buffer off heap. This has the advantage that the space being used for the data is actually not considered in the GC runs and could in theory could lead you down the road of managing the data structure in a memory mapped file.

Gareth Davis 2010-10-14 15:20:25

Answer 4

+5 A:

Your problem is algorithmic and you are looking for a "reduction in strength" optimization.

Unfortunately, you've been too coy in the the problem description and say "Unfortunately, this algorithm requires the complete list of these values..." which is dubious. The simulation run has already passed a predicate which in itself tells you something about the sets that pass through the sieve.

I expect the data that meets the criteria has a low information content and therefore is amenable to substantial compression.

Without further information, we really can't help you more.

msw 2010-10-14 15:25:06

You are right. I've edited my question to add the pseudo algorithm I use.

romaintaz 2010-10-14 15:43:44

Answer 5

A:

There was a theory that I read awhile ago where you would write the data to disk and only read/write a chunk what you. Of course this describes virtual memory, but the difference here is that the programmer controls the flow and location rathan than the OS. The advantage there is that the OS is only allocated so much virtual memory to use, where you have access to the whole HD.

Or an easier option is just to increase your swap/paged memory, which I think would be silly but would help in your case.

After a quick google it seems like this function might help you if you are running on Windows: http://msdn.microsoft.com/en-us/library/aa366537(VS.85).aspx

Nathan Adams 2010-10-14 15:25:49

Answer 6

+2 A:

Can you get away with using floats instead of doubles? That would save you 100Mb.

JeremyP 2010-10-14 15:29:04

Floating point arithmetic is not exact and would affect the calculations.

Mark Robinson 2010-10-14 15:51:42

You are right Mark. However, in my experience, floats MAY sometimes be acceptable for credit risk calculations, but certainly not for front office systems.

Dommer 2010-10-14 15:55:12

@Mark Robinson: Your statement applies to doubles as well as floats. It is a question of whether the loss of precision matters in this particular scenario. Only the questioner can answer that.

JeremyP 2010-10-14 15:57:06

doubles are floats with more bits, they are no more exact than floats.

Dolphin 2010-10-14 15:57:52

Will have to check this point, it can be indeed a good idea...

romaintaz 2010-10-14 16:12:48

@JeremyP Good point, just saying it something to be aware of.

Mark Robinson 2010-10-14 17:29:10

After some benchmarks, and tests, I will go for a `float[]` instead of a `List<Double>`. This is not a perfect solution, but for the number of elements I have to store, it is quite enough.

romaintaz 2010-10-18 08:58:26

Answer 7

A:

From what you tell us it sounds like you need 10^6 x 30 processors (ie number of simulations multiplied by number of deals) each with a few K RAM. Perhaps, though, you don't have that many processors -- do you have 30 each of which has sufficient memory for the simulations for one deal ?

Seriously: parallelise your program and buy an 8-core computer with 32GB RAM (or 16-core w 64GB or ...). You are going to have to do this sooner or later, might as well do it now.

High Performance Mark 2010-10-14 15:35:37

Answer 8

+3 A:

You mentioned that the "engine" is not connected to a database, but have you considered using a database to store the lists of elements? Possibly an embedded DB such as SQLite?
If you used int or even short instead of string for the key field of your Map, that might save some memory.
If you need a collection object that guarantees order, then consider a Queue or a Stack instead of your List that you are currently using.
Possibly think of a way to run deals sequentially, as Dommer and Alan have already suggested.

I hope that was of some help!

EDIT:

Your comment about only having 30 keys is a good point.

In that case, since you have to calculate all your deals at the same time, then have you considered serializing your Lists to disk (i.e. XML)?
Or even just writing a text file to disk for each List, then after the deals are calculated, loading one file/List at a time to verify that List of conditions?

Of course the disadvantage is slow file IO, but, this would reduced your server's memory requirement.

JohnB 2010-10-14 15:40:52

The Map will only contains 30 keys (one per deal). So I will not save anything by using another type of class...

romaintaz 2010-10-14 15:47:26

@JohnB You suggested to use a database. So I created another subject regarding this particular idea: http://stackoverflow.com/questions/3936044/how-efficient-will-be-to-use-a-in-memory-database-to-store-millions-of-temporary

romaintaz 2010-10-14 17:55:20

Your second idea is quite the same thing that I suggested in my third edit. Maybe I will try this solution, but writing the file only every 100,000 values, not for every value in order to reduce the amount of disk I/O.

romaintaz 2010-10-14 17:57:38

I agree, don't do 100,000 `writeLines()`. Do one IO per `List`.

JohnB 2010-10-14 19:08:03

Answer 9

A:

You say you need access to all the values, but you cannot possibly operate on all of them at once? Can you serialize the data such that you can store it in a single file. Each record set apart either by some delimiter, key value, or simply the byte count. Keep a byte counter either way. Let that be a "circular file" composed of a left file and a right file operating like opposing stacks. As data is popped(read) off the left file it is processed and pushed(write) into the right file. If your next operation requires a previously processed value reverse the direction of the file transfer. Think of your algorithm as residing at the read/write head of your hard drive. You have access as you would with a list just using different methods and at much reduced speed. The speed hit will be significant but if you can optimize your sequence of serialization so that the most likely accessed data is at the top of the file in order of use and possibly put the left and right files on different physical drives and your page file on a 3rd drive you will benefit from increased hard disk performance due to sequential and simultaneous reads and writes. Of course its a bit harder than it sounds. Each change of direction requires finalizing both files. Logically something like, if (current data flow if left to right) {send EOF to right_file; left_file = left_file - right_file;} Practically you would want to leave all data in place where it physically resides on the drive and just manipulate the beginning and ending addresses for the files in the master file table. Literally operating like a pair of hard disk stacks. This will be a much slower, more complicated process than simply adding more memory, but very much more efficient than separate files and all that overhead for 1 file per record * millions of records. Or just put all your data into a database. FWIW, this idea just came to me. I've never actually done it or even heard of it done. But I imagine someone must have thought of it before me. If not please let me know. I could really use the credit on my resume.

slomobile 2010-10-14 17:38:23

ansaurus

tags:

views:

answers:

How to store millions of Double during a calculation?

related questions