views:

197

answers:

3

I'm having some problems reading a file with java. It is absolutely huge (2,5G) and adjusting my memory doesn't help. The data is all on a single line so I can't read it one line at a time. What I would like to do is to read the file until I find a certain string for example "<|start|>" or "<|end|>" and then print the data in between these strings so the memory is cleared and I can continue reading the rest of the file. So what I basically am looking for is a type of reader that starts reading at a certain start string and stops reading at a stop string. Can anyone help me?

+8  A: 

You need to open up a Reader (e.g. a BufferedReader wrapping an InputStreamReader wrapping a FileInputStream) and read chunks at a time with read(char[], int, int) or read(char[]). It's up to you to take care of finding the token - including in the case where it starts in one chunk and ends on another. Also be aware that read() may not fill the buffer; you need to use the return value to see how much data it's actually written to the array.

Jon Skeet
I get distracted for one second and Jon Skeet steals my glory :(
Anthony Forloney
A: 

Try this pseudo code:

 char [] start = {'<','|','s','t','a','r','t','|','>' };

 char [] start = {'<','|','e','n','d','|','>' };

 char [] buff  = new char[9];

 while( true ) {
     char c = readChar();
     if( c  == '<' ) {
         buff = readChars( 9 ) ; 
         if( buff == start ) {
             inside = true ;
             skip( 9 ); // start
         } else if( buff == end )  {
             inside = false;
             skip(7); // end 
         }
      } 
      if( inside ) {
          print( char ) ;
      }
 }

The idea is to read until you find the token and raise a flag, when the flag is on you print the value, if you find the end token you shutdown the flag.

There should be a number of ways to code the previous pseudo-code. I'll update this answer later.

OscarRyz
There is a problem with this approach: you may broke the string between the reads of the file.Like: on one buffer you may store something like "blabla<|st" and on the other "art|>"So it really won't work
Kico Lobo
@Kico Lobo: I don't understand the problem (given < doesn't appear elsewhere in the string)...
pgras
Well, actually..it does appear elsewhere in the string
@Kico Lobo. That's why I wrote ( `readChars( 9 )` ) and I said: *pseudocode* because how do you read 9 chars is not described here.
OscarRyz
+2  A: 

I would have a look to see if Scanner is suitable for your data. You can use the useDelimiter method to change the patterns it uses to tokenize the input.

McDowell