tags:

views:

127

answers:

4

I am trying to perform some composition-based filtering on a large collection of strings (protein sequences).
I wrote a group of three subroutines in order to take care of it, but I'm running into trouble in two ways - one minor, one major. The minor trouble is that when I use List::MoreUtils 'pairwise' I get warnings about using $a and $b only once and them being uninitialized. But I believe I'm calling this method properly (based on CPAN's entry for it and some examples from the web).
The major trouble is an error "Can't use string ("17/32") as HASH ref while "strict refs" in use..."

It seems like this can only happen if the foreach loop in &comp is giving the hash values as a string instead of evaluating the division operation. I'm sure I've made a rookie mistake, but can't find the answer on the web. The first time I even looked at perl code was last Wednesday...

use List::Util;
use List::MoreUtils;
my @alphabet = (
 'A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I',
 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V'
);
my $gapchr = '-';
# Takes a sequence and returns letter => occurrence count pairs as hash.
sub getcounts {
 my %counts = ();
 foreach my $chr (@alphabet) {
  $counts{$chr} = ( $_[0] =~ tr/$chr/$chr/ );
 }
 $counts{'gap'} = ( $_[0] =~ tr/$gapchr/$gapchr/ );
 return %counts;
}

# Takes a sequence and returns letter => fractional composition pairs as a hash.
sub comp {
 my %comp = getcounts( $_[0] );
 foreach my $chr (@alphabet) {
  $comp{$chr} = $comp{$chr} / ( length( $_[0] ) - $comp{'gap'} );
 }
 return %comp;
}

# Takes two sequences and returns a measure of the composition difference between them, as a scalar.
# Originally all on one line but it was unreadable.

sub dcomp {
 my @dcomp = pairwise { $a - $b } @{ values( %{ comp( $_[0] ) } ) }, @{ values( %{ comp( $_[1] ) } ) };
 @dcomp = apply { $_ ** 2 } @dcomp;
 my $dcomp = sqrt( sum( 0, @dcomp ) ) / 20;
 return $dcomp;
}

Much appreciation for any answers or advice!

+1  A: 

re: Minor problem

This is fine and a common problem with (some) of the List::Util and List::MoreUtils modules.

One way to remove warnings is simply to declare those special variables in advance like so:

our ($a, $b);

Another would be to precede the pairwise with:

no warnings 'once';

See perlvar for more info about $a and $b

/I3az/

draegtun
Would you make that declaration in the sub block or in the, uh, global namespace (I'm not so good on the correct terms)?Seems to me like the former if I use the method once and the latter if I use it in many places?
reve_etrange
draegtun
+2  A: 

%{ $foo } will treat $foo as a hash reference and dereference it; similarly, @{} will dereference array references. Since comp returns a hash as a list (hashes becomes lists when passed to and from functions) and not a hash reference, the %{} is wrong. You could potentially leave off the %{}, but values is a special form and needs a hash, not a hash passed as a list. To pass the result of comp to values, comp needs to return a hash ref that then gets dereferenced.

There's another problem with your dcomp, namely that the order of values (as the documentation says) "are returned in an apparently random order", so the values passed to the pairwise block aren't necessarily for the same character. Instead of values, you can use hash slices. We're now back to comp returning a hash (as a list).

sub dcomp {
    my %ahisto = comp($_[0]);
    my %bhisto = comp($_[1]);
    my @keys = uniq keys %ahisto, keys %bhisto;
    my @dcomp = pairwise { $a - $b } , @ahisto{@keys}, @bhisto{@keys};
    @dcomp = apply { $_ ** 2 } @dcomp;
    my $dcomp = sqrt( sum( 0, @dcomp ) ) / 20;
    return $dcomp;
}

This doesn't address what happens if a character appears in only one of $_[0] and $_[1].

uniq left as an exercise for the reader.

outis
I think the hash slices have to be enclosed by @{ [ ] }?Now I know about the danger of values(), thanks.
reve_etrange
+4  A: 

There are a few bugs in your code. First, note from perldoc perlop:

Because the transliteration table is built at compile time, neither the SEARCHLIST nor the REPLACEMENTLIST are subjected to double quote interpolation.

So your counting method is incorrect. I also believe you are misusing pairwise. It is hard to evaluate what constitutes correct usage because you do not give examples of what output you should get with some simple inputs.

In any case, I would re-write this script as (there are some debugging statements sprinkled in):

#!/usr/bin/perl

use List::AllUtils qw( sum );
use YAML;

our ($a, $b);
my @alphabet = ('A' .. 'Z');
my $gap = '-';

my $seq1 = 'ABCD-EFGH--MNOP';
my $seq2 = 'EFGH-ZZZH-KLMN';

print composition_difference($seq1, $seq2);

sub getcounts {
    my ($seq) = @_;
    my %counts;
    my $pattern = join '|', @alphabet, $gap;
    $counts{$1} ++ while $seq =~ /($pattern)/g;
    warn Dump \%counts;
    return \%counts;
}

sub fractional_composition_pairs {
    my ($seq) = @_;
    my $comp = getcounts( $seq );
    my $denom = length $seq - $comp->{$gap};
    $comp->{$_} /= $denom for @alphabet;
    warn Dump $comp;
    return $comp;
}

sub composition_difference {
    # I think your use of pairwise in the original script
    # is very buggy unless every sequence always contains
    # all the letters in the alphabet and the gap character.
    # Is the gap character supposed to factor in the computations here?

    my ($comp1, $comp2) = map { fractional_composition_pairs($_) } @_;
    my %union;
    ++ $union{$_} for (keys %$comp1, keys %$comp2);

    my $dcomp;
    {
        no warnings 'uninitialized';
        $dcomp = sum map {
            ($comp1->{$_} - $comp2->{$_}) ** 2
        } keys %union;
    }

    return sqrt( $dcomp ) / 20; # where did 20 come from?
}
Sinan Ünür
Without going into the biology (doesn't seem like the place!), the sequence gaps don't constitute differences in the "composition" of the sequences, because they aren't a letter in the alphabet (of the 20 amino acids). The 20 is to (arbitrarily) normalize $dcomp from 0 to 1. I get map() now...thanks!
reve_etrange
+2  A: 

Just going through the code that you provided, this is how I would have written it. I don't know if this will work the way you wanted it to work though.

use strict;
use warnings;
our( $a, $b );

use List::Util;
use List::MoreUtils;

my @alphabet = split '', 'ARNDCQEGHILKMFPSTWYV';
my $gapchr = '-';
# Takes a sequence and returns letter => occurrence count pairs as hash.
sub getcounts {
  my( $sequence ) = @_;
  my %counts;
  for my $chr (@alphabet) {
    $counts{$chr} = () = $sequence =~ /($chr)/g;
    # () = forces list context
  }
  $counts{'gap'} = () = $sequence =~ /($gapchr)/g;
  return %counts if wantarray; # list context
  return \%counts; # scalar context
  # which is what happens inside of %{  }
}

# Takes a sequence and returns letter => fractional composition pairs as a hash
sub comp {
  my( $sequence ) = @_;
  my %counts = getcounts( $sequence );
  my %comp;
  for my $chr (@alphabet) {
    $comp{$chr} = $comp{$chr} / ( length( $sequence ) - $counts{'gap'} );
  }
  return %comp if wantarray; # list context
  return \%comp; # scalar context
}

# Takes two sequences and returns a measure of the composition difference
#   between them, as a scalar.
sub dcomp {
  my( $seq1, $seq2 ) = @_;
  my @dcomp = pairwise { $a - $b }
    @{[ values( %{ comp( $seq1 ) } ) ]},
    @{[ values( %{ comp( $seq2 ) } ) ]};
  # @{[ ]} makes a list into an array reference, then dereferences it.
  # values always returns a list
  # a list, or array in scalar context, returns the number of elements
  # ${  } @{  } and %{  } forces their contents into scalar context

  @dcomp = apply { $_ ** 2 } @dcomp;
  my $dcomp = sqrt( sum( 0, @dcomp ) ) / 20;
  return $dcomp;
}

One of the most important things you need to know are the differences between scalar, list, and void contexts. This is because everything behaves different, in the different contexts.

Brad Gilbert
Ah, yes I see about the contexts. 'if wantarray' is a neat trick.
reve_etrange