views:

1269

answers:

6

I am sequentially processing a large file and I'd like to keep a large chunk of it in memory, 16gb ram available on a 64 bit system.

A quick and dirty way is to do this, is simply wrap the input stream into a buffered input stream, unfortunately, this only gives me a 2gb buffer. I'd like to have more of it in memory, what alternatives do I have?

+3  A: 

How about letting the OS deal with the buffering of the file? Have you checked what the performance impact of not copying the whole file into JVMs memory is?

EDIT: You could then use either RandomAccessFile or the FileChannel to efficiently read the necessary parts of the file into the JVMs memory.

Alexander
Initially it was not buffered, java would read a little, process it, then read a little more. With each read it would send an IO request, wait for io to complete. Buffering the input increased the sped linearly, the bigger the buffer, the faster the processing.
Achille
Have a look at Java NIO, it can perform much more efficient asynchronous file I/O operations. See the link in my comment to your question -- it has quite a good comparison of various methods.
Alexander
+3  A: 

Have you considered the MappedByteBuffer in java.nio? It's over my head but maybe it is what you are looking for.

Josh
I thought so, too, but it looks like the backing buffer of a ByteBuffer is *still* a normal buffer, so it has the same limitations as a raw buffer.
Derek Park
There's two sorts of buffer, one use a byte array and the other (direct) uses a fixed location out of the Java heap. Unfortunately neight can go above 2GB. This is not currently fixed in "more NIO features" (JDK7 probably). Vote for it. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6347833
Tom Hawtin - tackline
A: 

I think there are 64 bit JVMs that will support nonstandard limits.

You might try buffering chunks.

davenpcj
+1  A: 

I doubt that buffering more than 2gb at a time is going to be a huge win anyway. Depending on the amount of processing you're doing, you might be able to read in nearly as fast as you process. To speed it up, you might try using a two-threaded producer-consumer model (one thread reads the file and hands the data off to the other thread for processing).

Michael Myers
+1  A: 

The OS is going to cache as much of the file as it can, so trying to outsmart the cache manager probably isn't going to get you very much.

From a performance perspective, you will be much better served by keeping the bytes outside the JVM (transferring huge chunks of data between the OS and JVM is relatively slow). You can achieve this goal by using a MappedByteBuffer backed by a direct memory block.

Here's a pertinent how-to type of article: article

Kevin Day
A: 

Take a look at ScatteringByteChannel and GatheringByteChannel. These allow you to create a number of buffers and fill them with a read,

Even the NIO buffers use ints as an index, and so you won't be able to write against standard APIs if you buffer more than 2 Gb of data. If you're sure that you want to do this (I advise letting the OS take care of buffering for you), you'll have to write your own abstraction. I'd recommend using a ByteBuffer as a model, and simply declaring the "int" parameters with "long". Under the covers, you can do the computation to find out which real ByteBuffer is being indexed, and get/put the desired location.

erickson