views:

181

answers:

4

I have a file from which I read data. All the text from this file is stored in a String variable (a very big variable). Then in another part of my app I want to walk through this string and extract useful information, step-by-step (parsing the string).

In the meanwhile my memory gets full and an OutOfMemory exception keeps me from further processing. I think it would be better to process the data directly while reading the inputstream from the file. But for organizational aims, I would like to pass the String to another part in my application.

What should I do to keep the memory from overflowing?

+6  A: 

You should be using the BufferedInputReader instead of storing this all into one large string.

If what you want to parse happens to be on the same line, then StringTokenizer will work quite nicely, else you have to devise a way to read what you want from the file to parse out statements, then apply StringTokenizer to each statement.

Zombies
+1. Anthony: the general idea is you pass CURSORS (like in DB). They can be Readers in case of text, Streams in case of bytes, iterators in case of sequence of items, or whatever. You can transform one type into another (transform each item of the sequence, by example, one line in the file into some domain object) but what one area of the app is providing to another is a cursor, so it's a handle to consume the input one step at a time without incurring in the knowlegde of reading files or whatever transformation you implement in the middle.
helios
+4  A: 

If you can loosen your requirements a bit you could implement a java.lang.CharSequence backed by your file.

The CharSequence is supported many places in the JDK (A String is a CharSequence) . So this is a good alternative to a Reader-based implementation.

Thomas Jung
A: 

You must review your algorithm for dealing woth large data. You must process chunk-by-chank this data, or use random file access without storing data in memory. For example you can use StringTokenizer or StreamTokenizer as said @Zombies. You can see parser-lexer techniques: when parser parses some expression it asks to lexer to read next lexem(tokens), but doesn't reads whole input stream at once.

+1  A: 

Others have suggested reading and processing portions of your file at a single time. If possible, one of those ways would be better.

However, if this is not possible and you are able to load the String initially into memory as you indicate but it is later parsing of this string that creates problems, you may be able to use substrings. In Java a sub-string maps on top of the original char array and just takes memory for the base Object and then the start and length int pointers.

So, when you find a portion of the string that you want to keep separately, use something like:

String piece = largeString.substring(foundStart, foundEnd);

If you instead this or code that internally does this, then the memory use will increase dramatically:

new String(largeString.substring(foundStart, foundEnd));

Note that you must use String.substring() with care for this very reason. You could have a very large string off of which you take a substring and then discard your reference to the original string. The problem is the substring still references the original large char array. The GC will not release that until the substring also is removed. In cases like this, it's useful to actually use new String(...) to ensure the unused large array will be discarded by the GC (this is one of the few cases where you should ever use new String(...)).

Another technique, if you expect to have lots of little strings around and these are likely to have the same values, but come from an external source (like a file), is to use .intern() after creating the new string.

Note: This does depend on the implementation of String which you really shouldn't have to be aware of, but in practice for large applications sometimes you do have to rely on that knowledge. Be aware that future versions of Java may change this (though not likely).

Kevin Brock