tags:

views:

576

answers:

9

I have a millions of pairs of string of same length which I want to compare and find the position where it has mismatches.

For example for each $str1 and $str2 we want to find mismatch position with $str_source:

$str_source = "ATTCCGGG";

$str1       = "ATTGCGGG"; # 1 mismatch with Str1 at position 3 (0-based)
$str2       = "ATACCGGC"; # 2 mismatches with source at position  2 and 7

Is there a fast way to do it. Currently I have the C style method which I loop every position in both strings using 'substr' function. But this approach is horribly slow.

my @mism_pos;
for $i (0 .. length($str_source)) {
   $source_base = substr($str_source,$i,1);
   $str_base    = substr($str2,$i,$1);

  if ($source_base ne $str_base) {
     push @mism_pos,$i;
  }

}
+4  A: 

It sounds like this might be a performance critical part of your application. In this case, you may want to consider writing a C extension method to do the comparison.

Perl provides the XS extension mechanism which makes this reasonably straightforward.

Greg Hewgill
Inline is easier to use for small stuff like this.
JonB
+3  A: 

You're making 2 calls to substr for each character comparison which is probably what's slowing you down.

A few optimizations I would make

@source = split //,$str_source  #split first rather than substr
@base = split //, $str_base

for $i (0 .. length($str_source)) {
   $mism_pos{$1} = 1 if ($source[$i] ne $base); #hashing is faster than array push
}

return keys $mism_pos
Charles Ma
Although you are correct about the extra call to substr, you would need to profile this to prove that it's faster.
Kinopiko
I suspect this will be somewhat slower. Although it looks like a "heavy" function call, substr() is very fast -- internally it's just an array lookup. OTOH building arrays of one-character strings requires memory allocation and deallocation, and the memory overhead is significant. But as Kinopiko says, profile it ;)
j_random_hacker
+1  A: 

I was going to say, "write it in C" too.

Once there you can use optimization like comparing 4 characters at once (as 32-bit integers).

Or change your representation (4-letter, right?) to use 2-bit to represent a base (?), so that you can compare 16 characters at once.

pascal
+3  A: 

The fastest way to compare the strings to find differences would be to XOR each byte of them together then test for zero. If I had to do this I would just write a program in C to do the difference job rather than writing a C extension to Perl, then I would run my C program as a subprocess of Perl. The exact algorithm would depend on the length of the strings and the amount of data. However this would not take more than 100 lines of C. In fact, if you want to maximize speed, a program to XOR bytes of fixed-length strings and test for zero could be written in assembly language.

Kinopiko
If you're going to go all the way, be sure to compare 32-bit or 64-bit words instead of just bytes. :)
Greg Hewgill
I asked for the length of the strings but haven't got a reply yet.
Kinopiko
Do you mean using a clever bitwise hack to test *any* byte in a word for zero in a single op? If so, please explain the trick a bit. If not, I daresay XORing bytes then checking bytes for zero is no faster than comparing the bytes directly... :)
j_random_hacker
Assuming the alphabet is four letters, a string of eight letters would fit in a 32 bit word. Reduce each string to a word, XOR it and then if this XOR is not zero, AND the result with eight bit masks to find which have changed. This could be applied to longer sequences too. "Comparing the bytes" means subtracting them, in which case this would not work due to overflows.
Kinopiko
Gotcha, nice, +1. I was confused, thinking we were looking for any zero byte rather than any *nonzero* byte. (BTW check out Method 4 at http://www.stdlib.net/~colmmacc/2009/03/01/optimising-strlen/ for a neat way to handle *that* problem.)
j_random_hacker
+5  A: 

Those look like gene sequences. If the strings are all 8-characters, and the domain of possible codes is ( A, C, G, T ) you might consider transforming the data somehow before processing it. That would give you only 65536 possible strings, so you can specialise your implementation.

For example, you write a method that takes an 8-character string and maps it to an integer. Memoize that so that the operation will be quick. Next, write a comparison function, that given two integers, tells you how they differ. You would call this in a suitable looping construct with a numeric equality test like unless ( $a != $b ) before calling the comparison - a short circuit for identical codes if you will.

martin clayton
+1, but on reflection the amount of work you need to do to build the memoisation "key" probably dwarfs the computation cost! You would need to either do a straight hash lookup of a string (which internally Perl will surely do by looping over its characters), or compute an integer key by bit-shifting each nucleotide in 2 bits at a time -- i.e. solve the problem to get the key you need to solve the problem! :)
j_random_hacker
+14  A: 

Inline::C


The computation is easy, do it with Inline::C (read perldoc Inline::C-Cookbook and perldoc Inline::C for documentation):

use Inline C => << '...';                                                       
  void find_diffs(char* x, char* y) {                                           
    int i;                                                                      
    Inline_Stack_Vars;                                                          
    Inline_Stack_Reset;                                                         
    for(i=0; x[i] && y[i]; ++i) {                                               
      if(x[i] != y[i]) {                                                        
        Inline_Stack_Push(sv_2mortal(newSViv(i)));                              
      }                                                                         
    }                                                                           
    Inline_Stack_Done;                                                          
  }                                                                             
...                                                                             

@diffs= find_diffs("ATTCCGGG","ATTGCGGG");  print "@diffs\n";                   
@diffs= find_diffs("ATTCCGGG","ATACCGGC");  print "@diffs\n";

Here is the output of this script:

> script.pl 
3
2 7

PDL

If you want to process a lot of data fast in Perl, learn PDL (Documentation):

use PDL; 
use PDL::Char;                                                                  
$PDL::SHARE=$PDL::SHARE; # keep stray warning quiet 

my $source=PDL::Char->new("ATTCCGGG");                                          
for my $str ( "ATTGCGGG", "ATACCGGC") {                                         
  my $match =PDL::Char->new($str);                                              
  my @diff=which($match!=$source)->list;                                        
  print "@diff\n";                                                              
}

(Same output as first script.)

Notes: I used PDL very happily in genomic data processing. Together with memory mapped access to data stored on the disk, huge amounts of data can be processed quickly: all processing is done in highly optimized C loops. Also, you can easily access the same data through Inline::C for any features missing in PDL.

Note however, that the creation of one PDL vector is quite slow (constant time, it's acceptable for large data structures). So, you rather want to create one large PDL object with all your input data in one go than looping over individual data elements.

Yaakov Belch
+4  A: 

Here is a benchmarking script to figure out if the differences in speed of various approaches. Just keep in mind that there will be a lag the first time a script using Inline::C is invoked as the C compiler is invoked etc. So, run the script once, and then benchmark.

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark qw( cmpthese );

my ($copies) = @ARGV;
$copies ||= 1;

my $x = 'ATTCCGGG' x $copies;
my $y = 'ATTGCGGG' x $copies;
my $z = 'ATACCGGC' x $copies;

sub wrapper { 
    my ($func, @args) = @_;
    for my $s (@args) {
        my $differences = $func->($x, $s);
        # just trying to ensure results are not discarded
        if ( @$differences == 0 ) { 
            print "There is no difference\n";
        }
    }
    return;
}

cmpthese -5, {
    explode  => sub { wrapper(\&where_do_they_differ, $y, $z) },
    mism_pos => sub { wrapper(\&mism_pos, $y, $z) },
    inline_c => sub {
        wrapper(\&i_dont_know_how_to_do_stuff_with_inline_c, $y, $z) },
};

sub where_do_they_differ {
    my ($str1, $str2) = @_;
    my @str1 = split //, $str1;
    my @str2 = split //, $str2;
    [ map {$str1[$_] eq $str2[$_] ? () : $_} 0 .. length($str1) - 1 ];
}

sub mism_pos {
    my ($str1, $str2) = @_;
    my @mism_pos;

    for my $i (0 .. length($str1) - 1) {
        if (substr($str1, $i, 1) ne substr($str2, $i, 1) ) {
            push @mism_pos, $i;
        }
    }
    return \@mism_pos;
}

sub i_dont_know_how_to_do_stuff_with_inline_c {
    [ find_diffs(@_) ];
}

use Inline C => << 'EOC';
    void find_diffs(char* x, char* y) {
        int i;
        Inline_Stack_Vars;
        Inline_Stack_Reset;
        for(i=0; x[i] && y[i]; ++i) {
            if(x[i] != y[i]) {
                Inline_Stack_Push(sv_2mortal(newSViv(i)));
            }
        }
        Inline_Stack_Done;
    }
EOC

Results (using VC++ 9 on Windows XP with AS Perl 5.10.1) with $copies = 1:

            Rate  explode mism_pos inline_c
explode  15475/s       --     -64%     -84%
mism_pos 43196/s     179%       --     -56%
inline_c 98378/s     536%     128%       --

Results with $copies = 100:

            Rate  explode mism_pos inline_c
explode    160/s       --     -86%     -99%
mism_pos  1106/s     593%       --     -90%
inline_c 10808/s    6667%     877%       --
Sinan Ünür
+2  A: 

Some classic string compare optimizations:

optimal mismatch - start comparing at the END of the search string. e.g. search for ABC in ABDABEABF if you compare at the beginning you will move along the pattern one char at a time. If you search from the end you will be able to jump along three chars

bad character heuristic - select the least commonly occurring char and search on that first. e.g. in english a 'z' character is rare and good string search functions will take a search for 'maze' and start comparing on the 3rd char

james
+1  A: 

I don't know how efficient it is, but you could always xor the two strings you are matching, and find the index of the first mismatch.

#! /usr/bin/env perl
use strict;
use warnings;
use 5.10.1;

my $str_source = "ATTCCGGG";

my $str1       = "ATTGCGGG";
my $str2       = "ATACCGGC";
my $str3       = "GTTCCGGG";

# this returns the index of all of the mismatches (zero based)
# it returns an empty list if the two strings match.
sub diff_index{
  my($a,$b) = @_;
  my $cmp = $a^$b;

  my @cmp;
  while( $cmp =~ /[^\0]/g ){ # match non-zero byte
    push @cmp, pos($cmp) - 1;
  }
  return @cmp;
}

for my $str ( $str_source, $str1, $str2, $str3 ){
  say '# "', $str, '"';
  my @ret = diff_index $str_source, $str;
  if( @ret ){
    say '[ ', join( ', ', @ret), ' ]';
  }else{
    say '#   match';
  }
}
# "ATTCCGGG"
#   match
# "ATTGCGGG"
[ 3 ]
# "ATACCGGC"
[ 2, 7 ]
# "GTTCCGGG"
[ 0 ]


Running it through B::Concise shows that the CPU expensive operations, happen as single opcodes. Which means that those operations are run in C.

perl -MO=Concise,-exec,-compact,-src,diff_index test.pl |
perl -pE's/^[^#].*? \K([^\s]+)$/# $1/' # To fix highlighting bugs
main::diff_index:
# 15:   my($a,$b) = @_;
1  <;> nextstate(main 53 test.pl:15) # v:%,*,&,$
2  <0> pushmark # s
3  <$> gv(*_) # s
4  <1> rv2av[t3] # lK/3
5  <0> pushmark # sRM*/128
6  <0> padsv[$a:53,58] # lRM*/LVINTRO
7  <0> padsv[$b:53,58] # lRM*/LVINTRO
8  <2> aassign[t4] # vKS
# 16:   my $cmp = $a^$b;
9  <;> nextstate(main 54 test.pl:16) # v:%,*,&,$
a  <0> padsv[$a:53,58] # s
b  <0> padsv[$b:53,58] # s
c  <2> bit_xor[t6] # sK                     <-----  Single OP -----
d  <0> padsv[$cmp:54,58] # sRM*/LVINTRO
e  <2> sassign # vKS/2
# 18:   my @cmp;
f  <;> nextstate(main 55 test.pl:18) # v:%,*,&,{,$
g  <0> padav[@cmp:55,58] # vM/LVINTRO
# 20:   while( $cmp =~ /[^\0]/g ){ # match non-zero byte
h  <;> nextstate(main 57 test.pl:20) # v:%,*,&,{,$
i  <{> enterloop(next->r last->v redo->j) # v
s  <0> padsv[$cmp:54,58] # s
t  </> match(/"[^\\0]"/) # sKS/RTIME        <-----  Single OP -----
u  <|> and(other->j) # vK/1
# 21:     push @cmp, pos($cmp) - 1;
j      <;> nextstate(main 56 test.pl:21) # v:%,*,&,$
k      <0> pushmark # s
l      <0> padav[@cmp:55,58] # lRM
m      <0> padsv[$cmp:54,58] # sRM
n      <1> pos[t8] # sK/1
o      <$> const(IV 1) # s
p      <2> subtract[t9] # sK/2
q      <@> push[t10] # vK/2
r      <0> unstack # v
           goto # s
v  <2> leaveloop # vK/2
# 24:   return @cmp;
w  <;> nextstate(main 58 test.pl:24) # v:%,*,&,{,$
x  <0> pushmark # s
y  <0> padav[@cmp:55,58] 
z  <@> return # K
10 <1> leavesub[1 ref] # K/REFC,1
Brad Gilbert