views:

188

answers:

5

I have a large (3Gb) binary file of doubles which I access (more or less) randomly during an iterative algorithm I have written for clustering data. Each iteration does about half a million reads from the file and about 100k writes of new values.

I create the FileChannel like this...

f = new File(_filename);
_ioFile = new RandomAccessFile(f, "rw");
_ioFile.setLength(_extent * BLOCK_SIZE);
_ioChannel = _ioFile.getChannel();

I then use a private ByteBuffer the size of a double to read from it

private ByteBuffer _double_bb = ByteBuffer.allocate(8);

and my reading code looks like this

public double GetValue(long lRow, long lCol) 
{
    long idx = TriangularMatrix.CalcIndex(lRow, lCol);
    long position = idx * BLOCK_SIZE;
    double d = 0;
    try 
    {
        _double_bb.position(0);
        _ioChannel.read(_double_bb, position);
        d = _double_bb.getDouble(0);
    } 

    ...snip...

    return d;
}

and I write to it like this...

public void SetValue(long lRow, long lCol, double d) 
{
 long idx = TriangularMatrix.CalcIndex(lRow, lCol);
 long offset = idx * BLOCK_SIZE;
 try 
 {
  _double_bb.putDouble(0, d);
  _double_bb.position(0);
  _ioChannel.write(_double_bb, offset);
 } 

 ...snip...

}

The time taken for an iteration of my code increases roughly linearly with the number of reads. I have added a number of optimisations to the surrounding code to minimise the number of reads, but I am at the core set that I feel are necessary without fundamentally altering how the algorithm works, which I want to avoid at the moment.

So my question is whether there is anything in the read/write code or JVM configuration I can do to speed up the reads? I realise I can change hardware, but before I do that I want to make sure that I have squeezed every last drop of software juice out of the problem.

Thanks in advance

+1  A: 

Presumably if we can reduce the number of reads then things will go more quickly.

3Gb isn't huge for a 64 bit JVM, hence quite a lot of the file would fit in memory.

Suppose that you treat the file as "pages" which you cache. When you read a value, read the page around it and keep it in memory. Then when you do more reads check the cache first.

Or, if you have the capacity, read the whole thing into memory, in at the start of processing.

djna
32 bit only I'm afraid, which is a limitation placed on me by customers.
Simon
+2  A: 

Instead of reading into a ByteBuffer, I would use file mapping, see: FileChannel.map().

Also, you don't really explain how your GetValue(row, col) and SetValue(row, col) access the storage. Are row and col more or less random? The idea I have in mind is the following: sometimes, for image processing, when you have to access pixels like row + 1, row - 1, col - 1, col + 1 to average values; on trick is to organize the data in 8 x 8 or 16 x 16 blocks. Doing so helps keeping the different pixels of interest in a contiguous memory area (and hopefully in the cache).

You might transpose this idea to your algorithm (if it applies): you map a portion of your file once, so that the different calls to GetValue(row, col) and SetValue(row, col) work on this portion that's just been mapped.

Gregory Pakosz
I like the idea, I need to noodle on it a bit to see how it would fit. In fact the file is the bottom left triangle of a very large square of numbers in which (row, col) and (col, row) have identical values. Originally I had the whole thing in memory as a 1-D array of doubles and I index them with some arithmetic which allows me to get at them randomly and without worrying whether I put the row or column first. I have tried to access them contiguously but I like your idea of small meta-rectangles. In the region of code where I do most reads the order is not important so that may work.
Simon
why a downvote?
Gregory Pakosz
+1  A: 
  1. Access byte-by-byte always produce poor performance (not only in Java). Try to read/write bigger blocks (e.g. rows or columns).

  2. How about switching to database engine for handling such amounts of data? It would handle all optimizations for you.

May be This article helps you ...

ThinkJet
+1  A: 

You might want to consider using a library which is designed for managing large amounts of data and random reads rather than using raw file access routines.

The HDF file format may by a good fit. It has a Java API but is not pure Java. It's licensed under an Apache Style license.

Robert Christie
Looks interesting. What's a typical use case for HDF?
Joel
This link http://www.hdfgroup.org/why_hdf/ may be useful - it's the target of the HDF link above. According to their website, it's used when data is large, complex; needs fast or random io, etc.
Robert Christie
Turns out pytables uses this and I use pytables in other projects. I had in fact recently contemplated re-implementing the whole thing in python so I could use numpy, scipy and pytables. The case is getting stronger.
Simon
I'd come across it through PyTables as well - perhaps numpy and pytables is a better fit for this.
Robert Christie
+3  A: 

As long as your file is stored on a regular harddisk, you will get the biggest possible speedup by organizing your data in a way that gives your accesses locality, i.e. causes as many get/set calls in a row as possible to access the same small area of the file.

This is more important than anything else you can do because accessing random spots on a HD is by far the slowest thing a modern PC does - it takes about 10,000 times longer than anything else.

So if it's possible to work on only a part of the dataset (small enough to fit comfortably into the in-memory HD cache) at a time and then combine the results, do that.

Alternatively, avoid the issue by storing your file on an SSD or (better) in RAM. Even storing it on a simple thumb drive could be a big improvement.

Michael Borgwardt
good answer, thanks.
Simon