views:

1234

answers:

4

What is an efficient way for a Java multithreaded application where many threads have to read the exact same file (> 1GB in size) and expose it as an input stream? I've noticed that if there are many threads (> 32), the system starts to contend over I/O and has a lot of I/O waits.

I've considered loading the file into a byte array that's shared by all the threads - each thread would create a ByteArrayInputStream, but allocating a 1GB byte array just won't work well.

I've also considered using a single FileChannel and each thread creating an InputStream on top of it using Channels.newInputStream(), however it seems that it's the FileChannel that's maintaining the state for the InputStream.

+7  A: 

It seems to me that you're going to have to load the file into memory if you want to avoid IO contention. The operating system will do some buffering, but if you're finding that's not enough, you're going to have to do it yourself.

Do you really need 32 threads though? Presumably you don't have nearly that many cores - so use fewer threads and you'll get less context switching etc.

Do your threads all process the file from start to finish? If so, could you effectively split the file into chunks? Read the first (say) 10MB of data into memory, let all the threads process it, then move on to the next 10MB etc.

If that doesn't work for you, how much memory do you have compared with the size of the file? If you have plenty of memory but you don't want to allocate one huge array, you could read the whole file into memory, but into lots of separate smaller byte arrays. You'd then have to write an input stream which spans all of those byte arrays, but that should be doable.

Jon Skeet
@jon, would it be possible to use the nio tools to map a Java structure to the file on disk so all that is needed is to write out the java structure and let the JVM/OS figure out how to handle the actual reading details?
Thorbjørn Ravn Andersen
@Thorbjorn: Well Java supports memory mapped files, but if you have more information than the OS does about how you're going to use the file, you may be able to do better.
Jon Skeet
+1  A: 

A few ideas:

  1. Write a custom InputStream implementation that acts as a view onto a FileChannel. Write this such that it does not rely on any state in the FileChannel. (ie: each instance should keep track of its own position and reading should use absolute reads on the underlying FileChannel.) This at least gets you around the trouble you had with Channels.newInputStream(), but it may not solve your IO contention issues.

  2. Write a custom InputStream implementation that acts as a view onto a MappedByteBuffer. Memory mapping shouldn't be as bad as actually reading the whole thing into memory at once, but you'll still eat up 1GB of virtual address space.

  3. Same as #1, but have some sort of shared caching layer. I wouldn't try this unless 1 turns out to not be efficient enough and 2 isn't feasible. Really, the OS should already be doing some caching for you in #1, so here you're essentially trying to be smarter than the OS filesystem caching.

Laurence Gonsalves
+3  A: 

you can open the file multiple times in readonly mode. You can access the file in any way you want. Just leave the caching to the OS. When it's too slow you might consider some kind of chunk based caching where all threads can access the same cache.

Patrick Cornelissen
A: 

Thats a very big file. Can you get the file delivered as a smaller set of files? Just delivering this file will be a big job even on a corporate network.

Sometimes it is easier to change the process than the program.

You may even be better writing something to split the file up into a number of chunks and process them separately.

Fortyrunner