views:

214

answers:

5

I'm working with a big matrix (not sparse), it contains about 10^10 double. Of course I cannot keep it in memory, and I need just 1 row at time.

I thought to split it in files, every file 1 row (it requires a lot of files) and just read a file every time I need a row. do you know any more efficient way?

+1  A: 

Why do you want to store it in different files? Can't u use a single file?

You could use functions inside RandomAccessFile class to perform the reading from that File.

Aviator
you are right, RandomAccessFile can be a better solution.
BigG
thanks. :) do give it a try.
Aviator
A: 

So, 800KB per file, sounds like a good division. Nothing really stops you from using one giant file, of course. A matrix, at least one like yours that isn't sparse, can be considered a file of fixed length records, making random access a trivial matter.

If you do store it one file per row, I might suggest making a directory tree corresponding to decimal digits, so 0/0/0/0 through 9/9/9/9.

Considerations one way or the other...

  • is it being backed up? Do you have high-capacity backup media or something ordinary?
  • does this file ever change?
  • if it does change and it is backed up, does it change all at once or are changes localized?
DigitalRoss
it doesn't change and i have plenty of free space on my hard drive
BigG
If it doesn't change, I'm guessing it doesn't need to be backed up either. I think I agree with Aviator, it's looking like one big file is the way to go.
DigitalRoss
A: 

If you are going to be saving it in a file, I believe serializing it will save space/time over storing it as text.

Serializing the doubles will store them as 2 bytes (plus serialization overhead) and means that you will not have to convert these doubles back and forth to and from Strings when saving or loading the file.

Matt Boehm
right i forgot to write about it in my question, sorry!
BigG
A: 

It depends on the algorithms you want to execute, but I guess that in most cases a representation where each file contains some square or rectangular region would be better.

For example, matrix multiplication can be done recursively by breaking a matrix into submatrices.

starblue
no i just need 1 row
BigG
A: 

I'd suggest to use a disk-persistent cache like Ehcache. Just configure it to keep as many fragments of your matrix in memory as you like and it will take care of the serialization. All you have to do is decide on the way of fragmentation.

Another approach that comes to my mind is using Terracotta (which recently bought Ehache by the way). It's great to get a large network-attached heap that can easily manage your 10^10 double values without caring about it in code at all.

sfussenegger