ansaurus

Question

How can I set the file-read buffer size in Perl to optimize it for large files?

Answer 1

+2 A:

No, there's not (short of recompiling a modified perl), but you can read the whole file into memory, then work line by line from that:

use File::Slurp;
my $buffer = read_file("filename");
open my $in_handle, "<", \$buffer;
while ( my $line = readline($in_handle) ) {
}

Note that perl before 5.10 defaulted to using stdio buffers in most places (but often cheating and accessing the buffers directly, not through the stdio library), but in 5.10 and later defaults to its own perlio layer system. The latter seems to use a 4k buffer by default, but writing a layer that allows configuring this should be trivial (once you figure out how to write a layer: see perldoc perliol).

ysth 2009-08-09 10:21:01

Answer 2

+1 A:

Warning, the following code has only been light tested. The code below is a first shot at a function that will let you process a file line by line (hence the function name) with a user-definable buffer size. It takes up to four arguments:

an open filehandle (default is STDIN)
a buffer size (default is 4k)
a reference to a variable to store the line in (default is $_)
an anonymous subroutine to call on the file (the default prints the line).

The arguments are positional with the exception that the last argument may always be the anonymous subroutine. Lines are auto-chomped.

Probable bugs:

may not work on systems where line feed is the end of line character
will likely fail when combined with a lexical $_ (introduced in Perl 5.10)

You can see from an strace that it reads the file with the specified buffer size. If I like how testing goes, you may see this on CPAN soon.

#!/usr/bin/perl

use strict;
use warnings;
use Scalar::Util qw/reftype/;
use Carp;

sub line_by_line {
    local $_;
    my @args = \(
        my $fh      = \*STDIN,
        my $bufsize = 4*1024,
        my $ref     = \$_,
        my $coderef = sub { print "$_\n" },
    );
    croak "bad number of arguments" if @_ > @args;

    for my $arg_val (@_) {
        if (reftype $arg_val eq "CODE") {
            ${$args[-1]} = $arg_val;
            last;
        }
        my $arg = shift @args;
        $$arg = $arg_val;
    }

    my $buf;
    my $overflow ='';
    OUTER:
    while(sysread $fh, $buf, $bufsize) {
        my @lines = split /(\n)/, $buf;
        while (@lines) {
            my $line  = $overflow . shift @lines;
            unless (defined $lines[0]) {
                $overflow = $line;
                next OUTER;
            }
            $overflow = shift @lines;
            if ($overflow eq "\n") {
                $overflow = "";
            } else {
                next OUTER;
            }
            $$ref = $line;
            $coderef->();
        }
    }
    if (length $overflow) {
        $$ref = $overflow;
        $coderef->();
    }
}

my $bufsize = shift;

open my $fh, "<", $0
    or die "could not open $0: $!";

my $count;
line_by_line $fh, sub {
    $count++ if /lines/;
}, $bufsize;

print "$count\n";

Chas. Owens 2009-08-09 13:55:09

I started playing with `sysread` in response to this question, but I couldn't get happy about how to parse *lines* after that. This looks promising, but I wonder if it won't still turn out to be slower than Perl's built-in implementation (buffering notwithstanding).

Telemachus 2009-08-09 13:58:31

Hey, I never claimed it was going to be __fast__, just that it would read the files with the specified buffer size. That said, I am going to benchmark it against the common idiom and the results will be part of the docs.

Chas. Owens 2009-08-09 14:33:17

Answer 3

+5 A:

You can affect the buffering, assuming that you're running on an O/S that supports setvbuf. See the documentation for IO::Handle. You don't have to explicitly create an IO::Handle object as in the documentation if you're using perl 5.10; all handles are implicitly IO::Handles since that release.

use 5.010;
use strict;
use warnings;

use autodie;

use IO::Handle '_IOLBF';

open my $handle, '<:utf8', 'foo';

my $buffer;
$handle->setvbuf($buffer, _IOLBF, 0x10000);

while ( my $line = <$handle> ) {
    ...
}

Elliot Shank 2009-08-09 17:47:04

It would be nice to post a link to some more information about Perl 5.10 handles.

Brad Gilbert 2009-08-10 00:04:31

The only thing different from earlier versions is that handles are blessed into the IO::Handle package. That is the /only/ difference. In particular, merely opening a file doesn't mean you can invoke any methods on the handle. You have to "use IO::Handle" so that the methods get defined.

Elliot Shank 2009-08-11 02:03:53

That isn't new in 5.10; filehandles have been blessed into IO::Handle for a long time (or, for backwards compatibility, into FileHandle if that was loaded). But as Elliot says, the methods aren't defined unless you use IO::Handle.

ysth 2009-09-06 22:37:34

ansaurus

tags:

views:

answers:

How can I set the file-read buffer size in Perl to optimize it for large files?

related questions