views:

78

answers:

3

Can anyone help me with following problem? I need to permanently save what I today have in arrays, to later use the data for calculations. I explain an example below.

1, I generate a long[][] which is far too big for my computers RAM. It is generated one row after the other.

2, I calculate something from my long[][] and save the results in a double[][] - also too big for my RAM. I do not need the entire long[][] at the same time, as a small batch of rows are used in the calculations at the same time, and one row in the double[][] is filled for each batch.

3, I need to sort the double[][], and do a lot of other things not important here.

4, I repeat step 2 and 3 in a number of iterations (largish, >10000), which means I care about the speed of both access and sorting.

I know the size of the arrays, but obviously I cannot initialize them as they are too big, and also because it must be initialized by an int (so far, I can only run "small" calculations). Ofcourse, I can use Maps etc, but I have failed to get this working, and I do not understand which kind(s) I should use. I have never used maps/collections etc before. In the latter case I can use one of the columns in the arrays as keys, as they are identical (except from the type). The key could simply be the row number (expressed as a long).

Preferably, I want to solve this without using a database that needs installation of a server, as my program will be used by others than me.

I am more than grateful for any help and advice!

+1  A: 

For storing this data you could use netcdf or hdf5. You can get and save subsets of arrays.

DiggyF
Thanks for the advice!Both of them look promising.
EvoMangan
+1  A: 

If the arrays are larger than can be stored in your computer's RAM, then, obviously, you should store part of the array or its entirety on disk.

For this purpose, you can use a database. Now that you don't want to install a server, you can use an embedded database such as HSQLDB. You can configure HSQLDB to delete all data when your application terminates or to retain them for future use.

An alternative is to use a custom Map implementation that flushes the data to secondary storage whenever its size increases more than a threshold defined by you. For this purpose, multiple strategies are available: FIFO, LIFO, LRU, etc. Also whenever you need to access a certain element of the map, again you can load a bulk of adjacent elements from the disk (or again, use a strategy that is more appropriate for your use case) to reduce excessive disk I/O.

Bytecode Ninja
Great!I´ll test the alternatives, to see which is the most efficient way to do it. I guess HSQLDB is the easiest.Thanks a lot!
EvoMangan
A: 

Managing subset of data is likely to be the best solution.

However, You should ask yourself if you are using the right machine for the job. You can buy a new PC, Core 2 Duo 2.5 GHz with 4 Gb of memory for £225. You can buy a Quad core AMD with 8 GB for £380. You can buy 16 GB of memory for £320.

My point being that your time with worth something and you need to trade off how much work it will take you now and in the future to save some memory and how much that memory is worth.

Peter Lawrey
Well... yes, computers are cheap, especially if you (like me) can accept just a "loose" motherboard, a bounch of cables, and Linux. Still, one will always want to do more, and more... In my case, I can test small things with my computers, but as soon I want to analyse more interesting stuff RAM will not be enough.
EvoMangan
In which case you need to create a class which looks like an array but instead manages how much of the "array" is actually in memory. Basicly you need an inteligent long get(int x, int y) method. How much you have in memory is just a caching issue. One way to implement this is to use a memory mapped file. If you do this your datasize is limited by your disk space which is much cheaper than memory (but not as fast!)
Peter Lawrey