views:

46

answers:

2

I'm reading and processing a stream of input from the ARGV filehandle in Perl (i.e. the while(<>) construct) a regular filehandle, which may be STDIN. However, I need to analyze a significant portion of the input in order to detect which of four different but extremely similar formats it is encoded in (different ASCII encodings of FASTQ quality scores; see here). Once I've decided which format the data is in, I need to go back and parse those lines a second time to actually read the data.

So I need to read the first 500 lines or so of the stream twice. Or, to look at it another way, I need to read the first 500 lines, and then "put them back" so I can read them again. Since I may be reading from STDIN, I can't just seek back to the beginning. And the files are huge, so I can't just read everything into memory (although reading those first 500 lines into memory is ok). What's the best way to do this?

Alternatively, can I duplicate the input stream somehow?

Edit: Wait a minute. I just realized that I can't process the input as one big stream anymore, because I have to detect each file's format independently. So I can't use ARGV. The rest of the question still stands, though.

A: 

There is a CPAN module that provides an unread method for the IO::Handle class. However, its warnings make one somewhat cautious. I would evaluate its suitability carefully.

If you really only need to save away 500 lines, each reasonably short, that module might suffice; its example does use STDIN.

However, I'm nervous about magic ARGV. If your <> operator causes several distinct files to be opened and read, then I don't know that you're going to be able to back up to a different file than the one currently open.

So you might end up just writing the pushback logic yourself. Either that, or imposing some sort restriction on ARGV processing related to multiple input files and/or the nature of STDIN.

Most of my programs with magic ARGV processing have guards at their start that read something like:

if (@ARGV == 0 && -t STDIN) {
    # select one or the other of the next two lines:

    # opt 1: emit warning 
    warn "$0: reading stdin from /dev/tty\n";

    # opt 2: populate @ARGV
    @ARGV = grep { -f && -T } <*>;  # glob plain textfiles

 }

In the second case above, where it defaults to all the plain textfiles in the current directory, one should also decide what to do if grep produces the empty list.

For some programs that expect or at least admit directory arguments, I'll occasionally initialize an empty @ARGV to "." instead, so that the program defaults to the process's current working directory.

tchrist
See my edit. Cancel the ARGV.
Ryan Thompson
+1  A: 

As you said, if the filehandle might be STDIN, you can't use seek to rewind it. But it's still pretty simple. I wouldn't bother with a module:

my @lines;

while (<$file>) {
  push @lines, $_;
  last if @lines == 500;
}

... # examine @lines to determine format

while (defined( $_ = @lines ? shift @lines : <$file> )) {
  ... # process line
}

Remember that you need an explicit defined in this case, because the special case that adds an implicit defined to some while loops doesn't apply to this more complex expression.

cjm