ansaurus

Question

When accessing individual characters in a string in Perl, is substr or splitting to an array faster?

Answer 1

+4 A:

Here is what I would do instead of first trying to choose between substr and split:

#!/usr/bin/perl

use strict; use warnings;

my %dist;
while ( my $s = <> ) {
    while ( $s =~ /(.)/g ) {
        ++ $dist{ pos($s) }{ $1 };
    }
}

Update:

My curiosity got the best of me. Here is a benchmark:

#!/usr/bin/perl

use strict; use warnings;
use Benchmark qw( cmpthese );

my @chars = qw(A C G T);
my @to_split = my @to_substr = my @to_match = map {
    join '', map $chars[rand @chars], 1 .. 100
} 1 .. 1_000;

cmpthese -1, {
    'split'  => \&bench_split,
    'substr' => \&bench_substr,
    'match'  => \&bench_match,
};

sub bench_split {
    my %dist;
    for my $s ( @to_split ) {
        my @s = split //, $s;
        for my $i ( 0 .. $#s ) {
            ++ $dist{ $i }{ $s[$i] };
        }
    }
}

sub bench_substr {
    my %dist;
    for my $s ( @to_substr ) {
        my $u = length($s) - 1;
        for my $i (0 .. $u) {
            ++ $dist{ $i }{ substr($s, $i, 1) };
        }
    }
}

sub bench_match {
    my %dist;
    for my $s ( @to_match ) {
        while ( $s =~ /(.)/g ) {
            ++ $dist{ pos($s) }{ $1 };
        }
    }
}

Output:

         Rate  split  match substr
split  4.93/s     --   -31%   -65%
match  7.11/s    44%     --   -49%
substr 14.0/s   184%    97%     --

Sinan Ünür 2010-10-21 00:24:07

I feel like a nerd for viewing this answer as a love letter for Perl (TMTOWTDI). :)

Ether 2010-10-21 00:28:21

Answer 2

+8 A:

It really depends on exactly what you're doing with your data -- but hey, you're headed the right way with your last question! Don't guess, benchmark.

Perl provides the Benchmark module for exactly this kind of thing, and using it is really pretty straightforward. Here's a little sample code to get started with:

#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese);

my $dna;
$dna .= [qw(G A T C)]->[rand 4] for 1 .. 100;

sub frequency_substr {
  my $length = length $dna;
  my %hist;

  for my $pos (0 .. $length) {
    $hist{$pos}{substr $dna, $pos, 1} ++;
  }

  \%hist;
}

sub frequency_split {
  my %hist;
  my $pos = 0;
  for my $char (split //, $dna) {
    $hist{$pos ++}{$char} ++;
  }

  \%hist;
}

sub frequency_regmatch {
  my %hist;

  while ($dna =~ /(.)/g) {
    $hist{pos($dna)}{$1} ++;
  }

  \%hist;
}


cmpthese(-5, # Run each for at least 5 seconds
  { 
    substr => \&frequency_substr,
    split => \&frequency_split,
    regex => \&frequency_regmatch
  }
);

And a sample result:

         Rate  regex  split substr
regex  6254/s     --   -26%   -32%
split  8421/s    35%     --    -9%
substr 9240/s    48%    10%     --

Turns out substr is surprisingly fast. :)

hobbs 2010-10-21 00:30:00

Yes, my benchmark also shows substr being the winner after 1 million runs of each method. As a bonus, I got some household chores done while it was running.

Ryan Thompson 2010-10-21 00:39:50

Dont't forget `vec`: `$hist{$_}{vec $dna, $_, 8}++ for 0 .. $length` => -3% of `substr` in my benchmark.

Pedro Silva 2010-10-21 06:50:14

Interestingly, `unpack 'C*'` is actually slower than `split` (-15% than `substr`)---I wonder why.

Pedro Silva 2010-10-21 07:02:39

Answer 3

+3 A:

I have an example in Mastering Perl dealing with this problem. Do you want to create a bunch of individual scalars, each of which carries around the memory overhead of a Perl scalar, or store everything in a single string to reduce memory but maybe do more work. You say that you have a lot of these, so leaving them as single strings might work out much better for you if you are worried about memory.

Mastering Perl also has a couple chapters dealing with benchmarking and profiling, if you're curious about those.

Ether says to get it working first and worry about the rest later. Part of that is hiding the operations behind a task-oriented interface. A nice object-oriented module can do that for you. If you don't like the implmentation, you change it. However, the programs at the higher level don't have to change because the interface stays the same.

brian d foy 2010-10-21 02:10:22

The benchmark scripts more or less got the right idea about how I need to use the characters: as indices in a hash. I wrote my own benchmark script, but everyone here beat me to it. I got the same results. Memory is not an issue because I process the strings one at a time, and each is only 100 chars.

Ryan Thompson 2010-10-21 05:43:00

ansaurus

tags:

views:

answers:

When accessing individual characters in a string in Perl, is substr or splitting to an array faster?

Update:

related questions