views:

84

answers:

3

I'm writing a Perl script in which I need to loop over each character of a string. There's a lot of strings, and each is 100 characters long (they're short DNA sequences, in case you're wondering).

So, is it faster to use substr to extract each character one at a time, or is it faster to split the string into an array and then iterate over the array?

While I'm waiting for an answer, I suppose I'll go read up on how to benchmark things in Perl.

+4  A: 

Here is what I would do instead of first trying to choose between substr and split:

#!/usr/bin/perl

use strict; use warnings;

my %dist;
while ( my $s = <> ) {
    while ( $s =~ /(.)/g ) {
        ++ $dist{ pos($s) }{ $1 };
    }
}

Update:

My curiosity got the best of me. Here is a benchmark:

#!/usr/bin/perl

use strict; use warnings;
use Benchmark qw( cmpthese );

my @chars = qw(A C G T);
my @to_split = my @to_substr = my @to_match = map {
    join '', map $chars[rand @chars], 1 .. 100
} 1 .. 1_000;

cmpthese -1, {
    'split'  => \&bench_split,
    'substr' => \&bench_substr,
    'match'  => \&bench_match,
};

sub bench_split {
    my %dist;
    for my $s ( @to_split ) {
        my @s = split //, $s;
        for my $i ( 0 .. $#s ) {
            ++ $dist{ $i }{ $s[$i] };
        }
    }
}

sub bench_substr {
    my %dist;
    for my $s ( @to_substr ) {
        my $u = length($s) - 1;
        for my $i (0 .. $u) {
            ++ $dist{ $i }{ substr($s, $i, 1) };
        }
    }
}

sub bench_match {
    my %dist;
    for my $s ( @to_match ) {
        while ( $s =~ /(.)/g ) {
            ++ $dist{ pos($s) }{ $1 };
        }
    }
}

Output:

         Rate  split  match substr
split  4.93/s     --   -31%   -65%
match  7.11/s    44%     --   -49%
substr 14.0/s   184%    97%     --
Sinan Ünür
I feel like a nerd for viewing this answer as a love letter for Perl (TMTOWTDI). :)
Ether
+8  A: 

It really depends on exactly what you're doing with your data -- but hey, you're headed the right way with your last question! Don't guess, benchmark.

Perl provides the Benchmark module for exactly this kind of thing, and using it is really pretty straightforward. Here's a little sample code to get started with:

#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese);

my $dna;
$dna .= [qw(G A T C)]->[rand 4] for 1 .. 100;

sub frequency_substr {
  my $length = length $dna;
  my %hist;

  for my $pos (0 .. $length) {
    $hist{$pos}{substr $dna, $pos, 1} ++;
  }

  \%hist;
}

sub frequency_split {
  my %hist;
  my $pos = 0;
  for my $char (split //, $dna) {
    $hist{$pos ++}{$char} ++;
  }

  \%hist;
}

sub frequency_regmatch {
  my %hist;

  while ($dna =~ /(.)/g) {
    $hist{pos($dna)}{$1} ++;
  }

  \%hist;
}


cmpthese(-5, # Run each for at least 5 seconds
  { 
    substr => \&frequency_substr,
    split => \&frequency_split,
    regex => \&frequency_regmatch
  }
);

And a sample result:

         Rate  regex  split substr
regex  6254/s     --   -26%   -32%
split  8421/s    35%     --    -9%
substr 9240/s    48%    10%     --

Turns out substr is surprisingly fast. :)

hobbs
Yes, my benchmark also shows substr being the winner after 1 million runs of each method. As a bonus, I got some household chores done while it was running.
Ryan Thompson
Dont't forget `vec`: `$hist{$_}{vec $dna, $_, 8}++ for 0 .. $length` => -3% of `substr` in my benchmark.
Pedro Silva
Interestingly, `unpack 'C*'` is actually slower than `split` (-15% than `substr`)---I wonder why.
Pedro Silva
+3  A: 

I have an example in Mastering Perl dealing with this problem. Do you want to create a bunch of individual scalars, each of which carries around the memory overhead of a Perl scalar, or store everything in a single string to reduce memory but maybe do more work. You say that you have a lot of these, so leaving them as single strings might work out much better for you if you are worried about memory.

Mastering Perl also has a couple chapters dealing with benchmarking and profiling, if you're curious about those.

Ether says to get it working first and worry about the rest later. Part of that is hiding the operations behind a task-oriented interface. A nice object-oriented module can do that for you. If you don't like the implmentation, you change it. However, the programs at the higher level don't have to change because the interface stays the same.

brian d foy
The benchmark scripts more or less got the right idea about how I need to use the characters: as indices in a hash. I wrote my own benchmark script, but everyone here beat me to it. I got the same results. Memory is not an issue because I process the strings one at a time, and each is only 100 chars.
Ryan Thompson