views:

1203

answers:

9

I have the following code which reads in the follow file, append a \r\n to the end of each line and puts the result in a string buffer:

public InputStream getInputStream() throws Exception {
 StringBuffer holder = new StringBuffer();
 try{
  FileInputStream reader = new FileInputStream(inputPath);


        BufferedReader br = new BufferedReader(new InputStreamReader(reader));
        String strLine;
        //Read File Line By Line
        boolean start = true;
        while ((strLine = br.readLine()) != null)   {
         if( !start ) 
          holder.append("\r\n");

         holder.append(strLine);
         start = false;
        }
        //Close the input stream
        reader.close();
    }catch (Throwable e){//this is where the heap error is caught up to 2Gb
      System.err.println("Error: " + e.getMessage());
    }


 return new StringBufferInputStream(holder.toString());
}

I tried reading in a 400Mb file, and I changed the max heap space to 2Gb and yet it still gives the out of memory heap exception. Any ideas?

+12  A: 

To begin with Java strings are UTF-16 (i.e. 2 bytes per character), so assuming your input file is ASCII or a similar one-byte-per-character format then holder will be ~2x the size of the input data, plus the extra \r\n per line and any additional overhead. There's ~800MB straight away, assuming a very low storage overhead in StringBuffer.

I could also believe that the contents of your file is buffered twice - once at the I/O level and once in the BufferedReader.

However, to know for sure, it's probably best to look at what's actually on the heap - use a tool like HPROF to see exactly where your memory has gone.

I terms of solving this, I suggest you process a line at a time, writing out each line after your have added the line termination. That way your memory usage should be proportional to the length of a line, instead of the entire file.

Dave Rigby
I already considered that, but still doesn't explain why it went past 2Gb (and possible more, have not tested passed 2Gb)
erotsppa
Your app have ALOT less heap available than 2Gb. e.g. on Windows the address space of a single process is just 2Gb by default. Within that 2Gb you have to fit the mapping for all .dlls,the java vm probably reserve some space for itself etc. Within the remaining part, you'll have memory fragmentation - preventing a reallocating of a BIG object - such as your array from being reallocated(which need to copy the whole thing then free the original) cause there's not enough place for such a big thing - just small holes of free space where small things can fit.
nos
+13  A: 

It's an interesting question, but rather than stress over why Java is using so much memory, why not try a design that doesn't require your program to load the entire file into memory?

Chris W. Rea
I'm surprised I got down-voted on this response. Really, sometimes we developers waste time trying to figure out why a specific way of doing things isn't working as we'd hoped, when we ought to perhaps step back and try a different approach. I think any time one is dealing with very large files and loading the whole thing into memory, the first question should be "why?"
Chris W. Rea
When a developer asks for a solution, there's obviously a reason. Don't assume every question asked is from a high school student.
erotsppa
@erotsppa: So...what's the reason?
Andy Mikula
@erotsppa: Agreed. That's why I asked "why not", as opposed to stating imperatively "you should." I myself was asking a question as to why another approach wasn't considered. Don't assume every answer condescends :-)
Chris W. Rea
You don't have to be a high school student to get bogged down in the details and miss the bigger picture/alternate solutions..
Andrew Coleson
@Andreas_D: Disagree. Can solve a problem without directly answering the question. Often the question is the problem!
Chris W. Rea
@Andreas_D: Strongly disagree. I think cwrea's response is valid, and your down vote should be rescinded.
duffymo
You can bet that there wouldn't be a down vote for cwrea's answer if Jon Skeet posted the same thing.
duffymo
@duffymo: Nah, please no rescinding ... I gladly accept some criticism, that's part of what makes the community work :-)
Chris W. Rea
I think it's a valid point but it doesn't offer much / any help and so 6 up-votes seems rather excessive.
Adamski
@Andreas_D: Depends if you view the question as "why specifically am I seeing this exception" as opposed to "how can I avoid this exception". If the question is the latter, then an answer suggesting to re-design the program to avoid large memory consumption is useful. Offering explanations on Java internals won't help the OP with the fact that no matter what they *micro-tune*, the _fundamental_ approach of loading the file into memory has a bad side effect: the program won't scale and will hit a wall, eventually, despite any micro-tuning.
Chris W. Rea
This is not an answer but a veeeery useful comment. It should go in the comments section rather in the answer section and should not be upvoted ( since it doesn't address the question ) http://bit.ly/MohSi
OscarRyz
@cwrea: I'd argue that It's difficult to judge whether this approach is fundamentally wrong (and that the program will hit a wall) without knowing more about the application. It could be that the app is only ever reading / storing one file in memory, the host machine could have 256Gb of memory, the file size will never exceed X, etc.
Adamski
@Adamski: Agreed, which is again why I asked "why not [...]?" I didn't just phrase my answer in the form of a question because I've been watching too much Jeopardy! :-)
Chris W. Rea
Look at the method's return value - the approach is fundamentally wrong with almost 100% certainty and this is the only sensible answer.
Michael Borgwardt
The answer addresses the problem, perhaps not the specific question, but who cares if it solves the issue at hand? As far as this "this is not the point of SO" nonsense, well, apparently the SO users disagree as this is the top voted answer.
Ed Swangren
While I think answers that offer another approach to a problem... I think it is often of use to answer the actual question. It's much better to understand why one approach is better than another, instead of just taking another approach "because it works". I think the OP may need to consider a different design, but fundamentally is trying to understand things about memory in java. The lessons from the posted code will prove useful in the future. I don't think answers like this are completely out of place, but I certainly hope it doesn't become the accepted answer. @Ed Swangren: not anymore :-).
Tom
It may not become the ultimate accepted or top answer, but it will be the one with the most comments, LOL!
Chris W. Rea
Let me put it this way; if I ask a question, and some SO user says "hey, you're doing it all wrong to begin with, try this!", and I do, and it works great, I am happy.
Ed Swangren
+12  A: 

It may be to do with how the StringBuffer resizes when it reaches capacity - This involves creating a new char[] double the size of the previous one and then copying the contents across into the new array. Together with the points already made about characters in Java being stored as 2 bytes this will definitely add to your memory usage.

To resolve this you could create a StringBuffer with sufficient capacity to begin with, given that you know the file size (and hence approximate number of characters to read in). However, be warned that the array allocation will also occur if you then attempt to convert this large StringBuffer into a String.

Another point: You should typically favour StringBuilder over StringBuffer as the operations on it are faster.

You could consider implementing your own "CharBuffer", using for example a LinkedList of char[] to avoid expensive array allocation / copy operations. You could make this class implement CharSequence and perhaps avoid converting to a String altogether. Another suggestion for more compact representation: If you're reading in English text containing large numbers of repeated words you could read and store each word, using the String.intern() function to significantly reduce storage.

Adamski
When it creates the new char[] that is double the previous size, are all the memory allocated at once?? Suppose the previous char[] is 1GB, it will attempt to allcoate memory for 2Gb immediately? Or when it is actually filled?
erotsppa
It will only allocate the new array when the old one is full.
Adamski
so old array is 1GB, old array gets full, creates new array 2GB copies 1GB array to 2GB array (however you've currently got 3GB of memory on your hand) 1GB loses reference awaiting garbage collection, 2GB array becomes new storage and it's remaining space (being 1GB since the first 1GB was copied from the old array) starts getting used.
Sekhat
Exactly - It's a real killer.
Adamski
So, the answer will be like use initial capacity = file.size() ? if possible?
OscarRyz
It would seem to be file.size() * 2 at least, plus the number of new lines (for the extra \r being inserted).
Yishai
Yes, if you know the size in advance (which would have to increase the added characters, however), allocation the full size in advance is a good idea.
Michael Borgwardt
@Adamski, @Yishai: Why `file.size() * 2`? The capacity of a `StringBuffer` is counted in characters, not bytes, and there could hardly be more characters in the file than there are bytes (assuming no "exotic" encodings are used). An initial capacity of `file.size() + expectedLineCount * 2` would be more economic.
gustafc
@Gustafc - Apologies; you're correct. I'm going to delete my comment to avoid it causing confusion.
Adamski
@Adamski: you shouldn't typically favor StringBuilder over StringBuffer because it's faster. More specifically, StringBuffer is slower because it is threadsafe. StringBuilder is not threadsafe. If you are not dealing with multiple threads, you should use StringBuilder because it's faster.
Tom
@Tom: Thanks - I meant to write "is faster, as it performs no synchronization.".
Adamski
+3  A: 

It's the StringBuffer. The empty constructor creates a StringBuffer with a initial length of 16 Bytes. Now if you append something and the capacity is not sufficiant, it does an Arraycopy of the internal String Array to a new buffer.

So in fact, with each line appended the StringBuffer has to create a copy of the complete internal Array which nearly doubles the required memory when appending the last line. Together with the UTF-16 representation this results in the observed memory demand.

Edit

Michael is right, when saying, that the internal buffer is not incremented in small portions - it roughly doubles in size each to you need more memory. But still, in the worst case, say the buffer needs to expand capacity just with the very last append, it creates a new array twice the size of the actual one - so in this case, for a moment you need roughly three times the amount of memory.

Anyway, I've learned the lesson: StringBuffer (and Builder) may cause unexpected OutOfMemory errors and I'll always initialize it with a size, at least when I have to store large Strings. Thanks for the question :)

Andreas_D
-1 not true; StringBuffer will double in size when the current size is insufficient, not in little increments.
Michael Borgwardt
@Andreas, I only have JDK 1.5 with me, but the public java doc says that capacity is increased to at least double, so I don't think they are changing that. Check the ensureCapacity method. It could be that you are misreading it.
Yishai
No, the difference is between the *length* of the abstract sequence of characters, which is of course increased exactly by hte number of characters appended, and the *size* of the underlying array, which may be much larger and is expanded in large steps to reduce the amount of copying.
Michael Borgwardt
+1  A: 

At the last insert into the StringBuffer, you need three times the memory allocated, because the StringBuffer always expands by (size + 1) * 2 (which is already double because of unicode). So a 400GB file could require an allocation of 800GB * 3 == 2.4GB at the end of the inserts. It may be something less, that depends on exactly when the threshold is reached.

The suggestion to concatenate Strings rather than using a Buffer or Builder is in order here. There will be a lot of garbage collection and object creation (so it will be slow), but a much lower memory footprint.

[At Michael's prompting, I investigated this further, and concat wouldn't help here, as it copies the char buffer, so while it wouldn't require triple, it would require double the memory at the end.]

You could continue to use the Buffer (or better yet Builder in this case) if you know the maximum size of the file and initialize the size of the Buffer on creation and you are sure this method will only get called from one thread at a time.

But really such an approach of loading such a large file into memory at once should only be done as a last resort.

Yishai
Wow, this question is generating a lot of downvotes on answers. But if you downvote, at least state why.
Yishai
Using string concatenation will take an INSANELY long time. Quite possibly years. No, I am not exaggerating.
Michael Borgwardt
+10  A: 

You have a number of problems here:

  • Unicode: characters take twice as much space in memory as on disk (assuming a 1 byte encoding)
  • StringBuffer resizing: could double (permanently) and triple (temporarily) the occupied memory, though this is the worst case
  • StringBuffer.toString() temporarily doubles the occupied memory since it makes a copy

All of these combined mean that you could require temporarily up to 8 times your file's size in RAM, i.e. 3.2G for a 400M file. Even if your machine physically has that much RAM, it has to be running a 64bit OS and JVM to actually get that much heap for the JVM.

All in all, it's simply a horrible idea to keep such a huge String in memory - and it's totally unneccessary as well - since your method returns an InputStream, all you really need is a FilterInputStream that adds the line breaks on the fly.

Michael Borgwardt
How should one go about implementing a FilterInputStream subclass that adds the line breaks on the fly?
erotsppa
Just extend FilterInputStream and overwrite its read() methods to detect line breaks and return \r\n before continuing with the rest of the underlying stream. It would get a bit complicated if you want to support mark/reset, but you probably don't need that.
Michael Borgwardt
Another question: what do you actually want to achieve? Normalize line breaks? That seems to be all that this method is actually doing.
Michael Borgwardt
StringBuffer.toString() does not always make a copy. It's copy-on-write, which means the copy is delayed until the next time you modify the StringBuffer.
finnw
My JDK 1.6.0u12 sources disagree with you.
Michael Borgwardt
Michael Borgwardt: which read method to overwrite? There are a lot. Can you provide sample code?
erotsppa
You'll have to overwrite all of them, but you can have the array-based ones call the parameterless one and have that last one contain all your logic.
Michael Borgwardt
read() returns a single int, so how would I be able to return "\r\n"?
erotsppa
By remembering (in an object field) whether you've just encountered a newline and then returning these characters in consecutive calls.
Michael Borgwardt
Ok lastly, how should I implement the logic of the array-based one on top of the parameterless one?
erotsppa
Nevermind, I copied from the java source code. Not sure if this is the best way to go about it.
erotsppa
+1  A: 

I would suggest you use the OS file cache instead of copying the data into Java memory via characters and back to bytes again. If you re-read the file as required (perhaps transforming it as you go) it will be faster and very likely to be simpler

You need over 2 GB because 1 byte letters use char (2-bytes) in memory and when your StringBuffer resizes you need double that (to copy the old array to the larger new array) The new array is typically 50% larger so you need up to 6x the original file size. If the performance wasn't bad enough, you are using StringBuffer instead of StringBuilder which synchronizes every call when it is clearly not needed. (This only slows you down, but uses the same amount of memory)

Peter Lawrey
+1  A: 

Others have explained why you're running out of memory. As to how to solve this problem, I'd suggest writing a custom FilterInputStream subclass. This class would read one line at a time, append the "\r\n" characters and buffer the result. Once the line has been read by the consumer of your FilterInputStream, you'd read another line. This way you'd only ever have one line in memory at a time.

David
A: 

I also recommend checking out Commons IO FileUtils class for this. Specifically: org.apache.commons.io.FileUtils#readFileToString. You can also specify the encoding if you know you only are using ASCII.

joeslice