ansaurus

Question

Any code tips for speeding up random reads from a Java FileChannel?

Answer 1

+1 A:

Presumably if we can reduce the number of reads then things will go more quickly.

3Gb isn't huge for a 64 bit JVM, hence quite a lot of the file would fit in memory.

Suppose that you treat the file as "pages" which you cache. When you read a value, read the page around it and keep it in memory. Then when you do more reads check the cache first.

Or, if you have the capacity, read the whole thing into memory, in at the start of processing.

djna 2009-12-22 08:53:38

32 bit only I'm afraid, which is a limitation placed on me by customers.

Simon 2009-12-22 15:15:18

Answer 2

+2 A:

Instead of reading into a ByteBuffer, I would use file mapping, see: FileChannel.map().

Also, you don't really explain how your GetValue(row, col) and SetValue(row, col) access the storage. Are row and col more or less random? The idea I have in mind is the following: sometimes, for image processing, when you have to access pixels like row + 1, row - 1, col - 1, col + 1 to average values; on trick is to organize the data in 8 x 8 or 16 x 16 blocks. Doing so helps keeping the different pixels of interest in a contiguous memory area (and hopefully in the cache).

You might transpose this idea to your algorithm (if it applies): you map a portion of your file once, so that the different calls to GetValue(row, col) and SetValue(row, col) work on this portion that's just been mapped.

Gregory Pakosz 2009-12-22 09:00:47

I like the idea, I need to noodle on it a bit to see how it would fit. In fact the file is the bottom left triangle of a very large square of numbers in which (row, col) and (col, row) have identical values. Originally I had the whole thing in memory as a 1-D array of doubles and I index them with some arithmetic which allows me to get at them randomly and without worrying whether I put the row or column first. I have tried to access them contiguously but I like your idea of small meta-rectangles. In the region of code where I do most reads the order is not important so that may work.

Simon 2009-12-22 15:12:52

why a downvote?

Gregory Pakosz 2009-12-22 16:14:13

Answer 3

+1 A:

Access byte-by-byte always produce poor performance (not only in Java). Try to read/write bigger blocks (e.g. rows or columns).
How about switching to database engine for handling such amounts of data? It would handle all optimizations for you.

May be This article helps you ...

ThinkJet 2009-12-22 09:03:06

Answer 4

+1 A:

You might want to consider using a library which is designed for managing large amounts of data and random reads rather than using raw file access routines.

The HDF file format may by a good fit. It has a Java API but is not pure Java. It's licensed under an Apache Style license.

Robert Christie 2009-12-22 09:20:39

Looks interesting. What's a typical use case for HDF?

Joel 2009-12-22 09:30:05

This link http://www.hdfgroup.org/why_hdf/ may be useful - it's the target of the HDF link above. According to their website, it's used when data is large, complex; needs fast or random io, etc.

Robert Christie 2009-12-22 09:39:26

Turns out pytables uses this and I use pytables in other projects. I had in fact recently contemplated re-implementing the whole thing in python so I could use numpy, scipy and pytables. The case is getting stronger.

Simon 2009-12-22 15:14:46

I'd come across it through PyTables as well - perhaps numpy and pytables is a better fit for this.

Robert Christie 2009-12-22 15:27:58

Answer 5

+3 A:

As long as your file is stored on a regular harddisk, you will get the biggest possible speedup by organizing your data in a way that gives your accesses locality, i.e. causes as many get/set calls in a row as possible to access the same small area of the file.

This is more important than anything else you can do because accessing random spots on a HD is by far the slowest thing a modern PC does - it takes about 10,000 times longer than anything else.

So if it's possible to work on only a part of the dataset (small enough to fit comfortably into the in-memory HD cache) at a time and then combine the results, do that.

Alternatively, avoid the issue by storing your file on an SSD or (better) in RAM. Even storing it on a simple thumb drive could be a big improvement.

Michael Borgwardt 2009-12-22 09:48:25

good answer, thanks.

Simon 2009-12-22 15:13:27

ansaurus

tags:

views:

answers:

Any code tips for speeding up random reads from a Java FileChannel?

related questions