tags:

views:

693

answers:

4

Hi,

My task is to open a large file in READ&WRITE mode and i need to search some portion of text in that file by searching starting and end point. Then i need to write that searched area of text to a new file and delete that portion from the original file.

The above process i will do more times. So I thought that for these process, it will be easy by loading the file into memory by CharBuffer and can search easily by MATCHER class. But im getting HeapSpace exception while reading, even though i increased to 900MB by executing like below java -Xms128m -Xmx900m readLargeFile My code is

FileChannel fc = new FileInputStream(fFile).getChannel();
CharBuffer chrBuff = Charset.forName("8859_1").newDecoder().decode(fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size()));

For that above code every one suggested me that its a bad idea to load everything into memory and If file size is 300 MB means, it will be 600MB due to charSet.

So above is my task, then now suggest me some efficient ways. Note that my file size will be more and using JAVA only i've to do these things.

Thanks in Advance...

+2  A: 

Does your search pattern match more than one line? If not then the easiest solution is to read line by line :). Simple really

But if the search pattern matches multiple lines then you need to let us know because searching line by line will not work.

hhafez
A: 

You definitely do NOT want to load a 300MB file into a single large buffer with Java. The way you're doing things is supposed to be more efficient for large files than just using normal I/O, but when you run a Matcher against an entire file mapped into memory as you are, you can very easily exhaust memory.

First, your code memory maps the file into memory ... this will consume 300 Meg of memory in your virtual address space as the file is mmaped into it, although this is outside the heap. (Note that the 300 Meg of virtual address space is tied up until the MappedByteBuffer is garbage collected. See below for discussion. The JavaDoc for map warns you about this.) Next, you create a ByteBuffer backed by this mmaped file. This should be fine, as it's just a "view" of the mmaped file and should thus take minimal extra memory. It will be a small object in the heap with a "pointer" to a large object outside the heap. Next, you decode this into a CharBuffer, which means you make a copy of the 300 MB buffer, but you make a 600 MB copy (on the heap) because a char is 2 bytes.

To respond to a comment, and looking at the JDK Source code to be sure, when you call map() as the OP is, you do in fact map the entire file into memory. Looking at openJDK 6 b14 Windows native code sun.nio.ch.FileChannelImpl.c, it first calls CreateFileMapping, then calls MapViewOfFile. Looking at this source, if you ask to map the whole file into memory, this method will do exactly as you ask. To quote MSDN:

Mapping a file makes the specified portion of a file visible in the address space of the calling process.

For files that are larger than the address space, you can only map a small portion of the file data at one time. When the first view is complete, you can unmap it and map a new view.

The way the OP is calling map, the "specified portion" of the file is the entire file. This won't contribute to heap exhaustion, but it can contribute to virtual address space exhaustion, which is still an OOM error. This can kill your application just as thoroughly as running out of heap.

Finally, when you make a Matcher, the Matcher potentially makes more copies of this 600 MB CharBuffer, depending on how you use it. Ouch. That's a lot of memory used by a small number of objects! Given a Matcher, every time you call toMatchResult(), you'll make a String copy of the entire CharBuffer. Also, every time you call replaceAll(), at best you will make a String copy of the entire CharBuffer. At worst you will make a StringBuffer that will slowly be expanded to the full size of the replaceAll result (applying a lot of memory pressure on the heap), and then make a String from that.

Thus, if you call replaceAll on a Matcher against a 300 MB file, and your match is found, then you'll first make a series of ever-larger StringBuffers until you get one that is 600 MB. Then you'll make a String copy of this StringBuffer. This can quickly and easily lead to heap exhaustion.

Here's the bottom line: Matchers are not optimized for working on very large buffers. You can very easily, and without planning to, make a number of very large objects. I discovered this when doing something similar enough to what you're doing and encountering memory exhaustion, then looking at the source code for Matcher.

NOTE: There is no unmap call. Once you call map, the virtual address space outside the heap tied up by the MappedByteBuffer is stuck there until the MappedByteBuffer is garbage collected. As a result, you will be unable to perform certain operations on the file (delete, rename, ...) until the MappedByteBuffer is garbage collected. If call map enough times on different files, but don't have sufficient memory pressure in the heap to force a garbage collection, you can out of memory outside the heap. For a discussion, see Bug 4724038.

As a result of all of the discussion above, if you will be using it to make a Matcher on large files, and you will be using replaceAll on the Matcher, then memory mapped I/O is probably not the way to go. It will simply create too many large objects on the heap as well as using up a lot of your virtual address space outside the heap. Under 32 bit Windows, you have only 2GB (or if you have changed settings, 3GB) of virtual address space for the JVM, and this will apply significant memory pressure both inside and outside the heap.

I apologize for the length of this answer, but I wanted to be thorough. If you think any part of the above is wrong, please comment and say so. I will not do retaliatory downvotes. I am very positive that all of the above is accurate, but if something is wrong, I want to know.

Eddie
It should be noted that FileChannel.map() returns a Direct Byte Buffer. While yes, it is mapped, it is not necessarily loaded into memory, and hence not contributing to any potential memory exhaustion issues and not part of your 1.5GB claim.
Stu Thompson
@Stu Thompson: Look at the JDK source code. It is in fact loaded into memory -- at least initially into the virtual address space of the process, and as it is accessed, it certainly occupies real memory. However, it is most likely outside the heap.
Eddie
A: 

Claims that the FileChannel.map will load the entire file into memory are faulty, with reference to the MappedByteBuffer that FileChannel.map() returns. It is a 'Direct Byte Buffer', it not exhaust your memory (Direct Byte Buffers use the OS virtual memory subsystem to page data in and out of memory as required, allowing one to address much larger chunks of memory as they were physical RAM.) But then again, a single MBB will only work for files to ~2GB.

Try this:

FileChannel fc = new FileInputStream(fFile).getChannel();
MappedByteBuffer mbb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());

CharBuffer chrBuff = mbb.asCharBuffer();

It will not load the entire file into memory, and the chrBuff is only a view of the backing MappedByteBuffer, and not a copy.

I'm not sure how to handle the decoding, though.

Stu Thompson
You're not exactly correct. Look at the JDK source code. The entire file will be mapped into the Java process's virtual address space. In 32-bit Windows, you only have up to 2 GB of virtual address space unless you have made some changes. If you allocate a large heap, you leave less room for stuff outside the heap, and you can easily run into problems. We don't know what else the OP's program is doing, and what other native resources are allocated outside the heap.
Eddie
In addition, your suggestion won't work at all. When you call mbb.asCharBuffer(), you are not doing any decoding. You are just treating the ByteBuffer as a CharBuffer of half the number of characters. That is, when you get the first character, it will read TWO BYTES from the ByteBuffer and treat them as one char. This will work only if the file in question is using a 2-byte encoding scheme, which is NOT the case for ISO-8859-1, the encoding used by the OP. Look at the source code and read MSDN. This doesn't do what you thought it did.
Eddie
By the way, the reason that a single MBB will only work for files up to about 2 GB is that that's the limit of your virtual address space under Window by default. A mapped file is absolutely mapped into your virtual address space, but outside the heap. Also, note that there is no unmap() call. You have to wait for the DirectByteBuffer to be garbage collected. See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4724038
Eddie
1) At no point did I or the OP mention 32-bit Windows, sooo...your Windows-specific may be valid, but not for the ambiguous topic here. Think bigger.2) This limit for 32-bit Windows is for *all* memory allocation by the JVM (any single processes, as I understand, on Windows) and not just a single MappedByteBuffer. 3) Individual MappedByteBuffers are limited to ~2GB on ALL platforms because FileChannel.map's 'size' parameter is a Java int...even on a 64-bit OS with >2GB of addressable space4) MappedByteBuffers are NOT allocated on the heap, so your heap argument is mute.
Stu Thompson
@Eddie comment #2: I specifically did not claim to solve coding problem. And I don't do MSDN as I don't develop on or for Windows.
Stu Thompson
@Eddie comment #3: 1) Again, think bigger than just Windows; 2) I repeat: FileChannel.map takes a Java int for size, and that is a cross-platform limit for any single MBB; 3) The unmap bug has a potential workarounds, if one is actually needed. My folks, including myself, use MBBs in our applications without issue. The bug does not necessarily negate the utility and performance gains that can be had with MBBs.
Stu Thompson
@Eddie: And, lastly, this bug is Windows specific. We do not know the platform the Questioner is on, we do not know if he is on a 64-bit OS. Assume nothing.
Stu Thompson
A: 

In my case, adding -Djava.compiler=NONE after the classpath can solve the problem.

Alan