views:

554

answers:

4

I have a code below that try to identify the position of start and end codon of the given DNA sequences. We define start codon as a ATG sequence and end codon as TGA,TAA,TAG sequences.

The problem I have is that the code below works only for first two sequences (DM208659 and AF038953) but not the rest.

What's wrong with my approach below?

This code can be copy-pasted from here.

      #!/usr/bin/perl -w


while (<DATA>) {
    chomp;
    print "$_\n";
    my ($id,$rna_sq) = split(/\s+/,$_);

    local $_ = $rna_sq;
    while (/atg/g) {
        my $start = pos() - 2;

        if (/tga|taa|tag/g) {

            my $stop    = pos();
            my $gene    = substr( $_, $start - 1, $stop - $start + 1 ),$/;
            my $genelen = length($gene);
            my $ct      = "$id $start $stop $gene $genelen";
            print "\t$ct\n";

        }

    }

}

__DATA__
DM208659    gtgggcctcaaatgtggagcactattctgatgtccaagtggaaagtgctgcgacatttgagcgtcac
AF038953    gatcccagacctcggcttgcagtagtgttagactgaagataaagtaagtgctgtttgggctaacaggatctcctcttgcagtctgcagcccaggacgctgattccagcagcgccttaccgcgcagcccgaagattcactatggtgaaaatcgccttcaatacccctaccgccgtgcaaaaggaggaggcgcggcaagacgtggaggccctcctgagccgcacggtcagaactcagatactgaccggcaaggagctccgagttgccacccaggaaaaagagggctcctctgggagatgtatgcttactctcttaggcctttcattcatcttggcaggacttattgttggtggagcctgcatttacaagtacttcatgcccaagagcaccatttaccgtggagagatgtgcttttttgattctgaggatcctgcaaattcccttcgtggaggagagcctaacttcctgcctgtgactgaggaggctgacattcgtgaggatgacaacattgcaatcattgatgtgcctgtccccagtttctctgatagtgaccctgcagcaattattcatgactttgaaaagggaatgactgcttacctggacttgttgctggggaactgctatctgatgcccctcaatacttctattgttatgcctccaaaaaatctggtagagctctttggcaaactggcgagtggcagatatctgcctcaaacttatgtggttcgagaagacctagttgctgtggaggaaattcgtgatgttagtaaccttggcatctttatttaccaactttgcaataacagaaagtccttccgccttcgtcgcagagacctcttgctgggtttcaacaaacgtgccattgataaatgctggaagattagacacttccccaacgaatttattgttgagaccaagatctgtcaagagtaagaggcaacagatagagtgtccttggtaataagaagtcagagatttacaatatgactttaacattaaggtttatgggatactcaagatatttactcatgcatttactctattgcttatgccgtaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
BC021011    ggggagtccggggcggcgcctggaggcggagccgcccgctgggctaaatggggcagaggccgggaggggtgggggttccccgcgccgcagccatggagcagcttcgcgccgccgcccgtctgcagattgttctg
DM208660    gggatactcaaaatgggggcgctttcctttttgtctgtactgggaagtgcttcgattttggggtgtccc
AF038954    ggacccaagggggccttcgaggtgccttaggccgcttgccttgctctcagaatcgctgccgccatggctagtcagtctcaggggattcagcagctgctgcaggccgagaagcgggcagccgagaaggtgtccgaggcccgcaaaagaaagaaccggaggctgaagcaggccaaagaagaagctcaggctgaaattgaacagtaccgcctgcagagggagaaagaattcaaggccaaggaagctgcggcattgggatcccgtggcagttgcagcactgaagtggagaaggagacccaggagaagatgaccatcctccagacatacttccggcagaacagggatgaagtcttggacaacctcttggcttttgtctgtgacattcggccagaaatccatgaaaactaccgcataaatggatagaagagagaagcacctgtgctgtggagtggcattttagatgccctcacgaatatggaagcttagcacagctctagttacattcttaggagatggccattaaattatttccatatattataagagaggtccttccactttttggagagtagccaatctagctttttggtaacagacttagaaattagcaaagatgtccagctttttaccacagattcctgagggattttagatgggtaaatagagtcagactttgaccaggttttgggcaaagcacatgtatatcagtgtggacttttcctttcttagatctagtttaaaaaaaaaaaccccttaccattctttgaagaaaggaggggattaaataattttttcccctaacactttcttgaaggtcaggggctttatctatgaaaagttagtaaatagttctttgtaacctgtgtgaagcagcagccagccttaaagtagtccattcttgctaatggttagaacagtgaatactagtggaattgtttgggctgcttttagtttctcttaatcaaaattactagatgatagaattcaagaacttgttacatgtattacttggtgtatcgataatcatttaaaagtaaagactctgtcatgcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+1  A: 

It's never breaking out of your inner loop when the if (/tga|taa|tag/g) fails to find an end codon. It keeps matching /atg/g repeatedly, never advancing any further. You could forcibly eject it from the inner loop:

if (/tga|taa|tag/g) {
    ...
}
else {
    last;
}
John Kugelman
+4  A: 

I removed the use of $_ (I especially shuddered when you localized it -- you did so correctly, but why force yourself to worry if some other function is going to clobber $_, rather than use $rna_sq which is already available?

Additionally I corrected $start and $stop to be 0-based indexes into the string (which made the rest of the math more straight-forward), and calculated $genelen early so it could be used directly in the substr operation. (Alternatively, you could localize $[ to 1 to use 1-based array indexes, see perldoc perlvar.)

use strict;
use warnings;
while (my $line = <DATA>) {
    chomp $line;
    print "processing $line\n";
    my ($id, $rna_sq) = split(/\s+/, $line);

    while ($rna_sq =~ /atg/g) {
        # $start and $stop are 0-based indexes
        my $start = pos($rna_sq) - 3; # back up to include the start sequence

        # discard remnant if no stop sequence can be found
        last unless $rna_sq =~ /tga|taa|tag/g;

        my $stop    = pos($rna_sq);
        my $genelen = $stop - $start;

        my $gene    = substr($rna_sq, $start, $genelen);
        print "\t" . join(' ', $id, $start+1, $stop, $gene, $genelen) . "\n";
    }
}
Ether
+1  A: 

It all depends on whether you want to generate sequences which could overlap. For example, sequence AF038954 contains atgaccatcctccagacatacttccggcagaacagggatga, the end of which overlaps with atgaagtcttggacaacctcttggcttttgtctgtga. Do you want to report them both?

If you don't want to report sequences which overlap, this is a very simple problem, which you can solve with a single regexp:

while (<DATA>) {
    chomp;
    print "processing $_\n";
    my ($id, $rna_sq) = split;

    while ($rna_sq =~ /(atg.*?(?:tga|taa|tag))/g) {
      printf "\t%8s %4i %4i %s %i\n",
             $id,
             pos($rna_sq) - length($1) + 1,
             pos($rna_sq),
             $1,
             length($1);
      }
}

The regexp (atg.*?(?:tga|taa|tag)) matches your required start, then as little as possible of what comes next (that's the ? to stop the .* being "greedy") then your required end. Iterating over it with the while loop restarts after this match, which meets the requirement of not looking for overlaps.

If you do want overlapping sequences reported, you do need a two-stage process: find the start, find the end, and then find another start, picking up where you left off looking for the start the last time. But you can still do a simpler job using a second regexp:

while (<DATA>) {
    chomp;
    print "processing $_\n";
    my ($id, $rna_sq) = split;

    while ($rna_sq =~ /atg/g) {
      if ($' =~ /(.*?(?:tga|taa|tag))/) {
        my $match = "atg$1";
        printf "\t%8s %4i %4i %s %i\n",
               $id,
               pos($rna_sq) - 2,
               pos($rna_sq) - 3 + length($match),
               $match,
               length($match);
      }
    }
}

Here we use the (generally non-recommended) $' special variable, which contains the content after the match. We look in this to find the end of the sequence and output the details. Because our main global match against $rna_seq doesn't include the sequence (as it does above) we restart the search for a start where the previous search left off, that is just after the start we found. This way we do include overlapping sequences.

Tim
A: 

Hello! Perhaps you can help me-

How can I extract DNA sequence from genome browser (UCSC) if I have coordinates on chromosomes (there are very much coordinates)?

Thank you!

fed