views:

469

answers:

3

In perlfaq5, there's an answer for How do I count the number of lines in a file?. The current answer suggests a sysread and a tr/\n//. I wanted to try a few other things to see how much faster tr/\n// would be, and also try it against files with different average line lengths. I created a benchmark to try various ways to do it. I'm running this on Mac OS X 10.5.8 and Perl 5.10.1 on a MacBook Air:

  • Shelling out to wc (fastest except for short lines)
  • tr/\n// (next fastest, except for long average line lengths)
  • s/\n//g (usually speedy)
  • while( <$fh> ) { $count++ } (almost always a slow poke, except when tr/// bogs down)
  • 1 while( <$fh> ); $. (very fast)

Let's ignore that wc, which even with all the IPC stuff really turns in some attractive numbers.

On first blush, it looks like the tr/\n// is very good when the line lengths are small (say, 100 characters), but its performance drops off when they get large (1,000 characters in a line). The longer the lines get, the worse tr/\n// does. Is there something wrong with my benchmark, or is there something else going on in the internals that makes tr/// degrade? Why doesn't s/// degrade similarly?

First, the results.:

                         Rate very_long_lines-tr very_long_lines-$count very_long_lines-$. very_long_lines-s very_long_lines-wc
very_long_lines-tr     1.60/s                 --                   -10%               -12%              -39%               -72%
very_long_lines-$count 1.78/s                11%                     --                -2%              -32%               -69%
very_long_lines-$.     1.82/s                13%                     2%                 --              -31%               -68%
very_long_lines-s      2.64/s                64%                    48%                45%                --               -54%
very_long_lines-wc     5.67/s               253%                   218%               212%              115%                 --
                    Rate long_lines-tr long_lines-$count long_lines-$. long_lines-s long_lines-wc
long_lines-tr     9.56/s            --               -5%           -7%         -30%          -63%
long_lines-$count 10.0/s            5%                --           -2%         -27%          -61%
long_lines-$.     10.2/s            7%                2%            --         -25%          -60%
long_lines-s      13.6/s           43%               36%           33%           --          -47%
long_lines-wc     25.6/s          168%              156%          150%          88%            --
                     Rate short_lines-$count short_lines-s short_lines-$. short_lines-wc short_lines-tr
short_lines-$count 60.2/s                 --           -7%           -11%           -34%           -42%
short_lines-s      64.5/s                 7%            --            -5%           -30%           -38%
short_lines-$.     67.6/s                12%            5%             --           -26%           -35%
short_lines-wc     91.7/s                52%           42%            36%             --           -12%
short_lines-tr      104/s                73%           61%            54%            14%             --
                      Rate varied_lines-$count varied_lines-s varied_lines-$. varied_lines-tr varied_lines-wc
varied_lines-$count 48.8/s                  --            -6%             -8%            -29%            -36%
varied_lines-s      51.8/s                  6%             --             -2%            -24%            -32%
varied_lines-$.     52.9/s                  8%             2%              --            -23%            -30%
varied_lines-tr     68.5/s                 40%            32%             29%              --            -10%
varied_lines-wc     75.8/s                 55%            46%             43%             11%              --

Here's the benchmark. I do have a control in there, but it's so fast I just don't bother with it. The first time you run it, the benchmark creates the test files and prints some stats about their line lengths:

use Benchmark qw(cmpthese);
use Statistics::Descriptive;

my @files = create_files();

open my( $outfh ), '>', 'bench-out';

foreach my $file ( @files )
    {
    cmpthese(
        100, {
#               "$file-io-control" => sub { 
#                       open my( $fh ), '<', $file; 
#                   print "Control found 99999 lines\n";
#                       },
               "$file-\$count" => sub { 
                    open my( $fh ), '<', $file; 
                    my $count = 0;
                    while(<$fh>) { $count++ } 
                    print $outfh "\$count found $count lines\n";
                    },
               "$file-\$."     => sub { 
                    open my( $fh ), '<', $file; 
                    1 while(<$fh>); 
                    print $outfh "\$. found $. lines\n";
                    },
               "$file-tr"      => sub { 
                    open my( $fh ), '<', $file; 
                    my $lines = 0;
                    my $buffer;
                    while (sysread $fh, $buffer, 4096) {
                        $lines += ($buffer =~ tr/\n//);
                        }
                    print $outfh "tr found $lines lines \n";
                    },
               "$file-s"       => sub { 
                    open my( $fh ), '<', $file; 
                    my $lines = 0;
                    my $buffer;
                    while (sysread $fh, $buffer, 4096) {
                        $lines += ($buffer =~ s/\n//g);
                        }
                    print $outfh "s found $lines line\n";
                    },
               "$file-wc"       => sub { 
                    my $lines = `wc -l $file`;
                    chomp( $lines );
                    print $outfh "wc found $lines line\n";
                    },
                    }
           );   
     }

sub create_files
    {
            my @names;
    my @files = (
        [ qw( very_long_lines 10000  4000 5000 ) ],
        [ qw( long_lines   10000 700 800 ) ],
        [ qw( short_lines  10000  60  80 ) ],
        [ qw( varied_lines 10000  10 200 ) ],
        );

    foreach my $tuple ( @files )
        {
        push @names, $tuple->[0];
        next if -e $tuple->[0];
        my $stats = create_file( @$tuple );
        printf "%10s: %5.2f  %5.f \n", $tuple->[0], $stats->mean, sqrt( $stats->variance );
        }

    return @names;
    }


sub create_file
    {
    my( $name, $lines, $min, $max ) = @_;

    my $stats = Statistics::Descriptive::Full->new();

    open my( $fh ), '>', $name or die "Could not open $name: $!\n";

    foreach ( 1 .. $lines )
        {
        my $line_length = $min + int rand( $max - $min );
        $stats->add_data( $line_length );
        print $fh 'a' x $line_length, "\n";
        }

    return $stats;
    }
A: 

Long lines are about 65 times larger than short lines, and your numbers indicate that tr/\n// runs exactly 65 times slower. This is as expected.

wc evidently scales better for long lines. I don't really know why; perhaps because it is tuned to just count newlines, especially when you use the -l option.

Marcelo Cantos
I'd expect everything to slow down though, and it's not the case. The line length shouldn't really matter in the tr case because it reads in chunks of 4096 bytes, and it should still have to examine the entire string. Explain the other results.
brian d foy
Since you read the file immediately after writing it, I/O should be pretty much irrelevant in the above timings. The fact that you have to examine the entire string is precisely the reason performance should be inversely proportional to line length (assuming the "#" in "Rate (#/sec)" on the graph refers to lines rather than bytes).
Marcelo Cantos
@Marcelo: no, the rate refers to number of times the subroutine ran per second, so how many times the file had its newlines counted. Remember, though, that tr/// should almost always be examining chunks of 4096 characters, so it shouldn't care about line length.
brian d foy
+9  A: 

I wondered whether the benchmarks we've been using have too many moving parts: we are crunching data files of different sizes, using different line lengths, and trying to gauge the speed of tr relative to its competitors -- with an underlying (but untested) assumptions that tr is the method whose performance is varying with line length.

Also, as brian has pointed out in a few comments, we are feeding tr buffers of data that are always the same size (4096 bytes). If any of the methods should be insensitive to line size, it should be tr.

And then it struck me: what if tr were the stable reference point and the other methods were the ones varying with line size? When you look out your spaceship window, is it you or that Klingon bird-of-prey that's moving?

So I developed a benchmark that held the size of the data files constant: line length varies, but the total number of bytes stays the same. As the results show:

  • tr is the approach least sensitive to variation in line length. Since the total N of bytes processed is constant for all three line lengths tested (short, medium, long), this means that tr is quite efficient at editing the string it is given. Even though the short-line data file requires many more edits, the tr approach is able to crunch the data file almost as fast as it handles the long-line file.
  • The methods that rely on <> speed up as the lines become longer, although at a diminishing rate. This makes sense: since each call to <> requires some work, it should be slower to process a given N of bytes using shorter lines (at least over the range tested).
  • The s/// approach is also sensitive to line length. Like tr, this approach works by editing the string it is given. Again, shorter line length means more edits. Apparently, the ability of s/// to make such edits is much less efficient than that of tr.

Here are the results on Solaris with Perl 5.8.8:

#   ln = $.      <>, then check $.
#   nn = $n      <>, counting lines
#   tr = tr///   using sysread
#   ss = s///    using sysread

#   S = short lines  (50)
#   M = medium lines (500)
#   L = long lines   (5000)

       Rate nn-S
nn-S 1.66/s   --
ln-S 1.81/s   9%
ss-S 2.45/s  48%
nn-M 4.02/s 142%
ln-M 4.07/s 145%
ln-L 4.65/s 180%
nn-L 4.65/s 180%
ss-M 5.85/s 252%
ss-L 7.04/s 324%
tr-S 7.30/s 339%    # tr
tr-L 7.63/s 360%    # tr
tr-M 7.69/s 363%    # tr

The results on Windows ActiveState's Perl 5.10.0 were roughly comparable.

Finally, the code:

use strict;
use warnings;
use Set::CrossProduct;
use Benchmark qw(cmpthese);

# Args: file size (in million bytes)
#       N of benchmark iterations
#       true/false (whether to regenerate files)
#
# My results were run with 50 10 1
main(@ARGV);

sub main {
    my ($file_size, $benchmark_n, $regenerate) = @_;
    $file_size *= 1000000;
    my @file_names = create_files($file_size, $regenerate);
    my %methods = (
        ln => \&method_ln,  # $.
        nn => \&method_nn,  # $n
        tr => \&method_tr,  # tr///
        ss => \&method_ss,  # s///
    );
    my $combo_iter = Set::CrossProduct->new([ [keys %methods], \@file_names ]);
    open my $log_fh, '>', 'log.txt';
    my %benchmark_args = map {
        my ($m, $f) = @$_;
        "$m-$f" => sub { $methods{$m}->($f, $log_fh) }
    } $combo_iter->combinations;
    cmpthese($benchmark_n, \%benchmark_args);
    close $log_fh;
}

sub create_files {
    my ($file_size, $regenerate) = @_;
    my %line_lengths = (
        S =>    50,
        M =>   500,
        L =>  5000,
    );
    for my $f (keys %line_lengths){
        next if -f $f and not $regenerate;
        create_file($f, $line_lengths{$f}, $file_size);
    }
    return keys %line_lengths;
}

sub create_file {
    my ($file_name, $line_length, $file_size) = @_;
    my $n_lines = int($file_size / $line_length);
    warn "Generating $file_name with $n_lines lines\n";
    my $line = 'a' x ($line_length - 1);
    chop $line if $^O eq 'MSWin32';
    open(my $fh, '>', $file_name) or die $!;
    print $fh $line, "\n" for 1 .. $n_lines;
    close $fh;
}

sub method_nn {
    my ($data_file, $log_fh) = @_;
    open my $data_fh, '<', $data_file;
    my $n = 0;
    $n ++ while <$data_fh>;
    print $log_fh "$data_file \$n $n\n";
    close $data_fh;
}

sub method_ln {
    my ($data_file, $log_fh) = @_;
    open my $data_fh, '<', $data_file;
    1 while <$data_fh>;
    print $log_fh "$data_file \$. $.\n";
    close $data_fh;
}

sub method_tr {
    my ($data_file, $log_fh) = @_;
    open my $data_fh, '<', $data_file;
    my $n = 0;
    my $buffer;
    while (sysread $data_fh, $buffer, 4096) {
        $n += ($buffer =~ tr/\n//);
    }
    print $log_fh "$data_file tr $n\n";
    close $data_fh;
}

sub method_ss {
    my ($data_file, $log_fh) = @_;
    open my $data_fh, '<', $data_file;
    my $n = 0;
    my $buffer;
    while (sysread $data_fh, $buffer, 4096) {
        $n += ($buffer =~ s/\n//g);
    }
    print $log_fh "$data_file s/ $n\n";
    close $data_fh;
}

Update in response to Brad's comment. I tried all three variants, and they behaved roughly like s/\n//g -- slower for the data files with shorter lines (with the additional qualification that s/(\n)/$1/ was even slower than the others). The interesting part was that m/\n/g was basically the same speed as s/\n//g, suggesting that the slowness of the regex approach (both s/// and m//) does not hinge directly on the matter of editing the string.

FM
I think this is probably pretty close to the what might be going on. When I get some more free time, I want to tweak some more inputs to tease out some of the variables. For instance, just check tr/// against itself with different numbers of substitutions.
brian d foy
@brian I'll be interested to learn what you find. In a sense, my benchmarks (`tr-S`, `tr-M`, and `tr-L`) already test what you are proposing, if I understand you correctly. All three benchmarks process the same number of bytes, but `tr-S` makes 100 times more edits than `tr-L`; nonetheless, `tr-S` processes the file almost as rapidly.
FM
The `s///` version might be faster if it didn't change the length of the string. `s/(\n)/$1/` s`/(?>\n)//` `m/\n/g`
Brad Gilbert
+2  A: 

I'm also seeing tr/// get relatively slower as the line lengths increase although the effect isn't as dramatic. These results are from ActivePerl 5.10.1 (32-bit) on Windows 7 x64. I also got "too few iterations for a reliable count" warnings at 100 so I bumped the iterations up to 500.

        VL: 4501.06    288
        LO: 749.25     29
        SH: 69.38      6
        VA: 104.66     55
            Rate VL-$count     VL-$.     VL-tr      VL-s     VL-wc
VL-$count 2.82/s        --       -0%      -52%      -56%      -99%
VL-$.     2.83/s        0%        --      -51%      -56%      -99%
VL-tr     5.83/s      107%      106%        --      -10%      -99%
VL-s      6.45/s      129%      128%       11%        --      -99%
VL-wc      501/s    17655%    17602%     8490%     7656%        --
            Rate LO-$count     LO-$.      LO-s     LO-tr     LO-wc
LO-$count 16.5/s        --       -1%      -50%      -51%      -97%
LO-$.     16.8/s        1%        --      -50%      -51%      -97%
LO-s      33.2/s      101%       98%        --       -3%      -94%
LO-tr     34.1/s      106%      103%        3%        --      -94%
LO-wc      583/s     3424%     3374%     1655%     1609%        --
            Rate SH-$count     SH-$.      SH-s     SH-tr     SH-wc
SH-$count  120/s        --       -7%      -31%      -67%      -81%
SH-$.      129/s        7%        --      -26%      -65%      -80%
SH-s       174/s       45%       35%        --      -52%      -73%
SH-tr      364/s      202%      182%      109%        --      -43%
SH-wc      642/s      433%      397%      269%       76%        --
            Rate VA-$count     VA-$.      VA-s     VA-tr     VA-wc
VA-$count 92.6/s        --       -5%      -36%      -63%      -79%
VA-$.     97.4/s        5%        --      -33%      -61%      -78%
VA-s       146/s       57%       50%        --      -42%      -67%
VA-tr      252/s      172%      159%       73%        --      -43%
VA-wc      439/s      374%      351%      201%       74%        --

Edit: I did a revised benchmark to compare the rates for different line lengths. It clearly shows that tr/// starts out with a big advantage for short lines that rapidly disappears as the lines grow longer. As for why this happens, I can only speculate that tr/// is optimized for short strings.

Line count rate comparison

Michael Carman
Remember that tr/// should be operating on the same sized strings since its getting input from sysread in 4096 byte chunks, just that the lines in that string are different lengths.
brian d foy