views:

3139

answers:

8

I have a method that creates a MessageDigest (a hash) from a file, and I need to do this to a lot of files (>= 100,000). How big should I make the buffer used to read from the files to maximize performance?

Most everyone is familiar with the basic code (which I'll repeat here just in case):

MessageDigest md = MessageDigest.getInstance( "SHA" );
FileInputStream ios = new FileInputStream( "myfile.bmp" );
byte[] buffer = new byte[4 * 1024]; // what should this value be?
int read = 0;
while( ( read = ios.read( buffer ) ) > 0 )
    md.update( buffer, 0, read );
ios.close();
md.digest();

What is the ideal size of the buffer to maximize throughput? I know this is system dependent, and I'm pretty sure its OS, FileSystem, and HDD dependent, and there maybe other hardware/software in the mix.

(I should point out that I'm somewhat new to Java, so this may just be some Java API call I don't know about.)

Edit: I do not know ahead of time the kinds of systems this will be used on, so I can't assume a whole lot. (I'm using Java for that reason.)

Edit: The code above is missing things like try..catch to make the post smaller

A: 

Yes, it's probably dependent on various things - but I doubt it will make very much difference. I tend to opt for 16K or 32K as a good balance between memory usage and performance.

Note that you should have a try/finally block in the code to make sure the stream is closed even if an exception is thrown.

Jon Skeet
I edited the post about the try..catch. In my real code I have one, but I left it out to make the post shorter.
ARKBAN
+2  A: 

In the ideal case we should have enough memory to read the file in one read operation. That would be the best performer because we let the system manage File System , allocation units and HDD at will. In practice you are fortunate to know the file sizes in advance, just use the average file size rounded up to 4K (default allocation unit on NTFS). And best of all : create a benchmark to test multiple options.

Ovidiu Pacurar
+2  A: 

You could use the BufferedStreams/readers and then use their buffer sizes.

I believe the BufferedXStreams are using 8192 as the buffer size, but like Ovidiu said, you should probably run a test on a whole bunch of options. Its really going to depend on the filesystem and disk configurations as to what the best sizes are.

John Gardner
A: 

Make the buffer big enough for most of the files to be read in one shot. Be sure to reuse the same buffer and the same MessageDigest for reading different files.

Unrelated to the question: read Sun's code conventions, especially spacing around parens and usage of redundant curly braces. Avoid operator = in a while or if statement

ngn
While it's normally a good idea to avoid side effects in an if/while condition, looping round while reading from a stream (or similar) is *such* a common case that I think it makes sense to make an exception for it.
Jon Skeet
As I said in the comment, the code has been compressed for purposes of the post.
ARKBAN
+1  A: 

In most cases, it really doesn't matter that much. Just pick a good size such as 4K or 16K and stick with it. If you're positive that this is the bottleneck in your application, then you should start profiling to find the optimal buffer size. If you pick a size that's too small, you'll waste time doing extra I/O operations and extra function calls. If you pick a size that's too big, you'll start seeing a lot of cache misses which will really slow you down. Don't use a buffer bigger than your L2 cache size.

Adam Rosenfield
+1  A: 

As already mentioned in other answers, use BufferedInputStreams.

After that, I guess the buffer size does not really matter. Either the program is I/O bound, and growing buffer size over BIS default, will not make any big impact on performance.

Or the program is CPU bound inside the MessageDigest.update(), and majority of the time is not spent in the application code, so tweaking it will not help.

(Hmm... with multiple cores, threads might help.)

Maglob
A: 

Reading files using Java NIO's FileChannel and MappedByteBuffer will most likely result in a solution that will be much faster than any solution involving FileInputStream. Basically, memory-map large files, and use direct buffers for small ones.

Alexander
+8  A: 

Optimum buffer size is related to a number of things: file system block size, CPU cache size and cache latency.

Most file systems are configured to use block sizes of 4096 or 8192. In theory, if you configure your buffer size so you are reading a few bytes more than the disk block, the operations with the file system can be extremely inefficient (i.e. if you configured your buffer to read 4100 bytes at a time, each read would require 2 block reads by the file system). If the blocks are already in cache, then you wind up paying the price of RAM -> L3/L2 cache latency. If you are unlucky and the blocks are not in cache yet, the you pay the price of the disk->RAM latency as well.

This is why you see most buffers sized as a power of 2, and generally larger than (or equal to) the disk block size. This means that one of your stream reads could result in multiple disk block reads - but those reads will always use a full block - no wasted reads.

Now, this is offset quite a bit in a typical streaming scenario because the block that is read from disk is going to still be in memory when you hit the next read (we are doing sequential reads here, after all) - so you wind up paying the RAM -> L3/L2 cache latency price on the next read, but not the disk->RAM latency. In terms of order of magnitude, disk->RAM latency is so slow that it pretty much swamps any other latency you might be dealing with.

So, I suspect that if you ran a test with different cache sizes (haven't done this myself), you will probably find a big impact of cache size up to the size of the file system block. Above that, I suspect that things would level out pretty quickly.

There are a ton of conditions and exceptions here - the complexities of the system are actually quite staggering (just getting a handle on L3 -> L2 cache transfers is mind bogglingly complex, and it changes with every CPU type).

This leads to the 'real world' answer: If your app is like 99% out there, set the cache size to 8192 and move on (even better, choose encapsulation over performance and use BufferedInputStream to hide the details). If you are in the 1% of apps that are highly dependent on disk throughput, craft your implementation so you can swap out different disk interaction strategies, and provide the knobs and dials to allow your users to test and optimize (or come up with some self optimizing system).

Kevin Day