I'm having some problems reading a file with java. It is absolutely huge (2,5G) and adjusting my memory doesn't help. The data is all on a single line so I can't read it one line at a time. What I would like to do is to read the file until I find a certain string for example "<|start|>" or "<|end|>" and then print the data in between these strings so the memory is cleared and I can continue reading the rest of the file. So what I basically am looking for is a type of reader that starts reading at a certain start string and stops reading at a stop string. Can anyone help me?
You need to open up a Reader
(e.g. a BufferedReader
wrapping an InputStreamReader
wrapping a FileInputStream
) and read chunks at a time with read(char[], int, int)
or read(char[])
. It's up to you to take care of finding the token - including in the case where it starts in one chunk and ends on another. Also be aware that read()
may not fill the buffer; you need to use the return value to see how much data it's actually written to the array.
Try this pseudo code:
char [] start = {'<','|','s','t','a','r','t','|','>' };
char [] start = {'<','|','e','n','d','|','>' };
char [] buff = new char[9];
while( true ) {
char c = readChar();
if( c == '<' ) {
buff = readChars( 9 ) ;
if( buff == start ) {
inside = true ;
skip( 9 ); // start
} else if( buff == end ) {
inside = false;
skip(7); // end
}
}
if( inside ) {
print( char ) ;
}
}
The idea is to read until you find the token and raise a flag, when the flag is on you print the value, if you find the end token you shutdown the flag.
There should be a number of ways to code the previous pseudo-code. I'll update this answer later.
I would have a look to see if Scanner is suitable for your data. You can use the useDelimiter method to change the patterns it uses to tokenize the input.