views:

1676

answers:

3

I understand that both Java and Perl try quite hard to find a one-size-fits all default buffer size when reading in files, but I find their choices to be increasingly antiquated, and am having a problem changing the default choice when it comes to Perl.

In the case of Perl, which I believe uses 8K buffers by default, similar to Java's choice, I can't find a reference using the perldoc website search engine (really Google) on how to increase the default file input buffer size to say, 64K.

From the above link, to show how 8K buffers don't scale:

If lines typically have about 60 characters each, then the 10,000-line file has about 610,000 characters in it. Reading the file line-by-line with buffering only requires 75 system calls and 75 waits for the disk, instead of 10,001.

So for a 50,000,000 line file with 60 characters per line (including the newline at the end), with an 8K buffer, it's going to make 366211 system calls to read a 2.8GiB file. As an aside, you can confirm this behaviour by looking at the disk i/o read delta (in Windows at least, top in *nix shows the same thing somehow too I'm sure) in the task manager process list as your Perl program takes 10 minutes to read in a text file :)

Someone asked the question about increasing the Perl input buffer size on perlmonks, someone replied here that you could increase the size of "$/", and thus increase the buffer size, however from the perldoc:

Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer.

So I assume that this does not actually increase the buffer size that Perl uses to read ahead from the disk when using the typical:

while(<>) {
    #do something with $_ here
    ...
}

"line-by-line" idiom.

Now it could be that a different "read a record at a time and then parse it into lines" version of the above code would be faster in general, and bypass the underlying problem with the standard idiom and not being able to change the default buffer size (if that's indeed impossible), because you could set the "record size" to anything you wanted and then parse each record into individual lines, and hope that Perl does the right thing and ends up doing one system call per record, but it adds complexity, and all I really want to do is get an easy performance gain by increasing the buffer used in the above example to a reasonably large size, say 64K, or even tuning that buffer size to the optimal size for long reads using a test script on my system, without needing extra hassle.

Things are much better in Java as far as straight-forward support for increasing the buffer size goes.

In Java, I believe the current default buffer size that java.io.BufferedReader uses is also 8192 bytes, although up-to-date references in the JDK docs are equivocal, e.g., the 1.5 docs say only:

The buffer size may be specified, or the default size may be accepted. The default is large enough for most purposes.

Luckily with Java you do not have to trust the JDK developers to have made the right decision for your application and can set your own buffer size (64K in this example):

import java.io.BufferedReader;
[...]
reader = new BufferedReader(new InputStreamReader(fileInputStream, "UTF-8"), 65536);
[...]
while (true) {
                String line = reader.readLine();
                if (line == null) {
                    break;
                }
                /* do something with the line here */
                foo(line);
}

There's only so much performance you can squeeze out of parsing one line at a time, even with a huge buffer, and modern hardware, and I'm sure there are ways to get every ounce of performance out of reading in a file by reading big many-line records and breaking each into tokens then doing stuff with those tokens once per record, but they add complexity and edge cases (although if there's an elegant solution in pure Java (only using the features present in JDK 1.5) that would be cool to know about). Increasing the buffer size in Perl would solve 80% of the performance problem for Perl at least, while keeping things straight-forward.

My question is:

Is there a way to adjust that buffer size in Perl for the above typical "line-by-line" idiom, similar how the buffer size was increased in the Java example?

+2  A: 

No, there's not (short of recompiling a modified perl), but you can read the whole file into memory, then work line by line from that:

use File::Slurp;
my $buffer = read_file("filename");
open my $in_handle, "<", \$buffer;
while ( my $line = readline($in_handle) ) {
}

Note that perl before 5.10 defaulted to using stdio buffers in most places (but often cheating and accessing the buffers directly, not through the stdio library), but in 5.10 and later defaults to its own perlio layer system. The latter seems to use a 4k buffer by default, but writing a layer that allows configuring this should be trivial (once you figure out how to write a layer: see perldoc perliol).

ysth
+1  A: 

Warning, the following code has only been light tested. The code below is a first shot at a function that will let you process a file line by line (hence the function name) with a user-definable buffer size. It takes up to four arguments:

  1. an open filehandle (default is STDIN)
  2. a buffer size (default is 4k)
  3. a reference to a variable to store the line in (default is $_)
  4. an anonymous subroutine to call on the file (the default prints the line).

The arguments are positional with the exception that the last argument may always be the anonymous subroutine. Lines are auto-chomped.

Probable bugs:

  • may not work on systems where line feed is the end of line character
  • will likely fail when combined with a lexical $_ (introduced in Perl 5.10)

You can see from an strace that it reads the file with the specified buffer size. If I like how testing goes, you may see this on CPAN soon.

#!/usr/bin/perl

use strict;
use warnings;
use Scalar::Util qw/reftype/;
use Carp;

sub line_by_line {
    local $_;
    my @args = \(
        my $fh      = \*STDIN,
        my $bufsize = 4*1024,
        my $ref     = \$_,
        my $coderef = sub { print "$_\n" },
    );
    croak "bad number of arguments" if @_ > @args;

    for my $arg_val (@_) {
        if (reftype $arg_val eq "CODE") {
            ${$args[-1]} = $arg_val;
            last;
        }
        my $arg = shift @args;
        $$arg = $arg_val;
    }

    my $buf;
    my $overflow ='';
    OUTER:
    while(sysread $fh, $buf, $bufsize) {
        my @lines = split /(\n)/, $buf;
        while (@lines) {
            my $line  = $overflow . shift @lines;
            unless (defined $lines[0]) {
                $overflow = $line;
                next OUTER;
            }
            $overflow = shift @lines;
            if ($overflow eq "\n") {
                $overflow = "";
            } else {
                next OUTER;
            }
            $$ref = $line;
            $coderef->();
        }
    }
    if (length $overflow) {
        $$ref = $overflow;
        $coderef->();
    }
}

my $bufsize = shift;

open my $fh, "<", $0
    or die "could not open $0: $!";

my $count;
line_by_line $fh, sub {
    $count++ if /lines/;
}, $bufsize;

print "$count\n";
Chas. Owens
I started playing with `sysread` in response to this question, but I couldn't get happy about how to parse *lines* after that. This looks promising, but I wonder if it won't still turn out to be slower than Perl's built-in implementation (buffering notwithstanding).
Telemachus
Hey, I never claimed it was going to be __fast__, just that it would read the files with the specified buffer size. That said, I am going to benchmark it against the common idiom and the results will be part of the docs.
Chas. Owens
+5  A: 

You can affect the buffering, assuming that you're running on an O/S that supports setvbuf. See the documentation for IO::Handle. You don't have to explicitly create an IO::Handle object as in the documentation if you're using perl 5.10; all handles are implicitly IO::Handles since that release.

use 5.010;
use strict;
use warnings;

use autodie;

use IO::Handle '_IOLBF';

open my $handle, '<:utf8', 'foo';

my $buffer;
$handle->setvbuf($buffer, _IOLBF, 0x10000);

while ( my $line = <$handle> ) {
    ...
}
Elliot Shank
It would be nice to post a link to some more information about Perl 5.10 handles.
Brad Gilbert
The only thing different from earlier versions is that handles are blessed into the IO::Handle package. That is the /only/ difference. In particular, merely opening a file doesn't mean you can invoke any methods on the handle. You have to "use IO::Handle" so that the methods get defined.
Elliot Shank
That isn't new in 5.10; filehandles have been blessed into IO::Handle for a long time (or, for backwards compatibility, into FileHandle if that was loaded). But as Elliot says, the methods aren't defined unless you use IO::Handle.
ysth