views:

80

answers:

2

I have a program output with one tandem repeat in different variants. Is it possible to search (in a string) for the motif and to tell the program to find all variants with maximum "3" mismatches/insertions/deletions?

+4  A: 

I will take a crack at this with the very limited information supplied.

First, a short friendly editorial:

<editorial>

Please learn how to ask a good question and how to be precise.

At a minimum, please:

  • Refrain from domain specific jargon such as "motif" and "tandem repeat" and "base pairs" without providing links or precise definitions;
  • Say what the goal is and what you have done so far;
  • Important: Provide clear examples of input and desired output.

It is not helpful to potential helpers on SO have to have to play 20 questions in comments to try and understand your question! I spent more time trying to figure out what you were asking than answering it.

</editorial>

The following program generates a string of 2 character pairs 5,428 pairs long in an array of 1,000 elements long. I realize it is more likely that you will be reading these from a file, but this is just an example. Obviously you would replace the random strings with your actual data from whatever source.

I do not know if 'AT','CG','TC','CA','TG','GC','GG' that I used are legitimate base pair combinations or not. (I slept through biology...) Just edit the map block pairs to legitimate pairs and change the 7 to the number of pairs if you want to generate legitimate random strings for testing.

If the substring at the offset point is 3 differences or less, the array element (a scalar value) is stored in an anonymous array in the value part of a hash. The key part of the hash is the substring that is a near match. Rather than array elements, the values could be file names, Perl data references or other relevant references you want to associate with your motif.

While I have just looked at character by character differences between the strings, you can put any specific logic that you need to look at by replacing the line foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); } with the comparison logic that works for your problem. I do not know how mismatches/insertions/deletions are represented in your string, so I leave that as an exercise to the reader. Perhaps Algorithm::Diff or String::Diff from CPAN?

It is easy to modify this program to have keyboard input for $target and $offset or have the string searched beginning to end rather than several strings at a fixed offset. Once again: it was not really clear what your goal is...

use strict; use warnings;

my @bps;
push(@bps,join('',map { ('AT','CG','TC','CA','TG','GC','GG')[rand 7] } 
       0..5428)) for(1..1_000);

my $len=length($bps[0]);
my $s_count= scalar @bps;

print "$s_count random strings generated $len characters long\n" ;

my $target="CGTCGCACAG";
my $offset=832;
my $nlen=length $target;
my %HoA;
my $diffs=0;
my @a2=split(//, $target);
substr($bps[-1], $offset, $nlen)=$target; #guarantee 1 match
substr($bps[-2], $offset, $nlen)="CATGGCACGG"; #anja example

foreach my $i (0..$#bps) {
    my $cand=substr($bps[$i], $offset, $nlen);
    my @a1=split(//, $cand);
    $diffs=0;
    foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); }
    next if $diffs > 3;
    push (@{$HoA{$cand}}, $i); 
}

foreach my $hit (keys %HoA) {
    my @a1=split(//, $hit);
    $diffs=0;
    my $ds="";
    foreach my $j (0..$#a1) { 
        if($a1[$j] eq $a2[$j]) {
            $ds.=" ";
        } else {
            $diffs++;
            $ds.=$a1[$j];
        }
    }   
    print "Target:       $target\n",
          "Candidate:    $hit\n",
          "Differences:  $ds       $diffs differences\n",
          "Array element: ";
    foreach (@{$HoA{$hit}}) {
         print "$_ " ;
     }
     print "\n\n";
}

Output:

1000 random strings generated 10858 characters long
Target:       CGTCGCACAG
Candidate:    CGTCGCACAG
Differences:                   0 differences
Array element: 999 

Target:       CGTCGCACAG
Candidate:    CGTCGCCGCG
Differences:        CGC        3 differences
Array element: 696 

Target:       CGTCGCACAG
Candidate:    CGTCGCCGAT
Differences:        CG T       3 differences
Array element: 851 

Target:       CGTCGCACAG
Candidate:    CGTCGCATGG
Differences:         TG        2 differences
Array element: 986 

Target:       CGTCGCACAG
Candidate:    CATGGCACGG
Differences:   A G    G        3 differences
Array element: 998 

..several cut out.. 

Target:       CGTCGCACAG
Candidate:    CGTCGCTCCA
Differences:        T CA       3 differences
Array element: 568 926 
drewk
The variant between 823 and 832 was just an example, between this one cant lay something else. The other variants are located somewhere else in the string... So I can't tell the program to look for this motif in different variants?
Anja
Sure, it can be indexed to any offset you want. The only thing I don't understand is "deletions" Does this mean that "CGTCGCACAG" might become shorter? From a Perl perspective, this is easy if the string is fixed length or delimited in some way. If you need to go through the string and interpret base pairs and infer what happened, that is harder. See the difference?
drewk
okay, I think I can ignore deletion as I don't have gaps in my sequence. So the only important thing is mismatch (e.g. from G to A). The only limitation is given by the motif (it has always the same length) as I give it through the keyboard. Then the program should search for the different variants (with maximum 3 mismatches)
Anja
I've been thinking that StackOverflow needs its own "How do I ask a good question page". This would be a good start.
brian d foy
@brian d foy: SO does need it own "How to ask a good question page!" Thanks for saying that is a good start. With Eric Raymond, I sit on the shoulder of a giant...
drewk
+1  A: 

I believe that there are routines for this sort of thing in BioPerl.

In any case, you might get better answers if you asked this over at BioStar, the bioinformatics stack exchange.

chrisamiller