You definitely do NOT want to load a 300MB file into a single large buffer with Java. The way you're doing things is supposed to be more efficient for large files than just using normal I/O, but when you run a Matcher
against an entire file mapped into memory as you are, you can very easily exhaust memory.
First, your code memory maps the file into memory ... this will consume 300 Meg of memory in your virtual address space as the file is mmap
ed into it, although this is outside the heap. (Note that the 300 Meg of virtual address space is tied up until the MappedByteBuffer
is garbage collected. See below for discussion. The JavaDoc for map
warns you about this.) Next, you create a ByteBuffer
backed by this mmap
ed file. This should be fine, as it's just a "view" of the mmap
ed file and should thus take minimal extra memory. It will be a small object in the heap with a "pointer" to a large object outside the heap. Next, you decode this into a CharBuffer
, which means you make a copy of the 300 MB buffer, but you make a 600 MB copy (on the heap) because a char
is 2 bytes.
To respond to a comment, and looking at the JDK Source code to be sure, when you call map()
as the OP is, you do in fact map the entire file into memory. Looking at openJDK 6 b14 Windows native code sun.nio.ch.FileChannelImpl.c
, it first calls CreateFileMapping
, then calls MapViewOfFile
. Looking at this source, if you ask to map the whole file into memory, this method will do exactly as you ask. To quote MSDN:
Mapping a file makes the specified portion of a file visible in the
address space of the calling process.
For files that are larger than the address space, you can only map a small portion
of the file data at one time. When the first view is complete, you can unmap it and
map a new view.
The way the OP is calling map, the "specified portion" of the file is the entire file. This won't contribute to heap exhaustion, but it can contribute to virtual address space exhaustion, which is still an OOM error. This can kill your application just as thoroughly as running out of heap.
Finally, when you make a Matcher
, the Matcher
potentially makes more copies of this 600 MB CharBuffer
, depending on how you use it. Ouch. That's a lot of memory used by a small number of objects! Given a Matcher
, every time you call toMatchResult()
, you'll make a String
copy of the entire CharBuffer
. Also, every time you call replaceAll()
, at best you will make a String
copy of the entire CharBuffer
. At worst you will make a StringBuffer
that will slowly be expanded to the full size of the replaceAll
result (applying a lot of memory pressure on the heap), and then make a String
from that.
Thus, if you call replaceAll
on a Matcher
against a 300 MB file, and your match is found, then you'll first make a series of ever-larger StringBuffer
s until you get one that is 600 MB. Then you'll make a String
copy of this StringBuffer
. This can quickly and easily lead to heap exhaustion.
Here's the bottom line: Matcher
s are not optimized for working on very large buffers. You can very easily, and without planning to, make a number of very large objects. I discovered this when doing something similar enough to what you're doing and encountering memory exhaustion, then looking at the source code for Matcher
.
NOTE: There is no unmap
call. Once you call map
, the virtual address space outside the heap tied up by the MappedByteBuffer
is stuck there until the MappedByteBuffer
is garbage collected. As a result, you will be unable to perform certain operations on the file (delete, rename, ...) until the MappedByteBuffer
is garbage collected. If call map enough times on different files, but don't have sufficient memory pressure in the heap to force a garbage collection, you can out of memory outside the heap. For a discussion, see Bug 4724038.
As a result of all of the discussion above, if you will be using it to make a Matcher
on large files, and you will be using replaceAll
on the Matcher
, then memory mapped I/O is probably not the way to go. It will simply create too many large objects on the heap as well as using up a lot of your virtual address space outside the heap. Under 32 bit Windows, you have only 2GB (or if you have changed settings, 3GB) of virtual address space for the JVM, and this will apply significant memory pressure both inside and outside the heap.
I apologize for the length of this answer, but I wanted to be thorough. If you think any part of the above is wrong, please comment and say so. I will not do retaliatory downvotes. I am very positive that all of the above is accurate, but if something is wrong, I want to know.