ansaurus

Question

How can I extract start and end codon from DNA sequences in Perl?

Answer 1

+1 A:

It's never breaking out of your inner loop when the if (/tga|taa|tag/g) fails to find an end codon. It keeps matching /atg/g repeatedly, never advancing any further. You could forcibly eject it from the inner loop:

if (/tga|taa|tag/g) {
    ...
}
else {
    last;
}

John Kugelman 2009-10-13 04:14:56

Answer 2

+4 A:

I removed the use of $_ (I especially shuddered when you localized it -- you did so correctly, but why force yourself to worry if some other function is going to clobber $_, rather than use $rna_sq which is already available?

Additionally I corrected $start and $stop to be 0-based indexes into the string (which made the rest of the math more straight-forward), and calculated $genelen early so it could be used directly in the substr operation. (Alternatively, you could localize $[ to 1 to use 1-based array indexes, see perldoc perlvar.)

use strict;
use warnings;
while (my $line = <DATA>) {
    chomp $line;
    print "processing $line\n";
    my ($id, $rna_sq) = split(/\s+/, $line);

    while ($rna_sq =~ /atg/g) {
        # $start and $stop are 0-based indexes
        my $start = pos($rna_sq) - 3; # back up to include the start sequence

        # discard remnant if no stop sequence can be found
        last unless $rna_sq =~ /tga|taa|tag/g;

        my $stop    = pos($rna_sq);
        my $genelen = $stop - $start;

        my $gene    = substr($rna_sq, $start, $genelen);
        print "\t" . join(' ', $id, $start+1, $stop, $gene, $genelen) . "\n";
    }
}

Ether 2009-10-13 04:54:27

Answer 3

+1 A:

It all depends on whether you want to generate sequences which could overlap. For example, sequence AF038954 contains atgaccatcctccagacatacttccggcagaacagggatga, the end of which overlaps with atgaagtcttggacaacctcttggcttttgtctgtga. Do you want to report them both?

If you don't want to report sequences which overlap, this is a very simple problem, which you can solve with a single regexp:

while (<DATA>) {
    chomp;
    print "processing $_\n";
    my ($id, $rna_sq) = split;

    while ($rna_sq =~ /(atg.*?(?:tga|taa|tag))/g) {
      printf "\t%8s %4i %4i %s %i\n",
             $id,
             pos($rna_sq) - length($1) + 1,
             pos($rna_sq),
             $1,
             length($1);
      }
}

The regexp (atg.*?(?:tga|taa|tag)) matches your required start, then as little as possible of what comes next (that's the ? to stop the .* being "greedy") then your required end. Iterating over it with the while loop restarts after this match, which meets the requirement of not looking for overlaps.

If you do want overlapping sequences reported, you do need a two-stage process: find the start, find the end, and then find another start, picking up where you left off looking for the start the last time. But you can still do a simpler job using a second regexp:

while (<DATA>) {
    chomp;
    print "processing $_\n";
    my ($id, $rna_sq) = split;

    while ($rna_sq =~ /atg/g) {
      if ($' =~ /(.*?(?:tga|taa|tag))/) {
        my $match = "atg$1";
        printf "\t%8s %4i %4i %s %i\n",
               $id,
               pos($rna_sq) - 2,
               pos($rna_sq) - 3 + length($match),
               $match,
               length($match);
      }
    }
}

Here we use the (generally non-recommended) $' special variable, which contains the content after the match. We look in this to find the end of the sequence and output the details. Because our main global match against $rna_seq doesn't include the sequence (as it does above) we restart the search for a start where the previous search left off, that is just after the start we found. This way we do include overlapping sequences.

Tim 2009-10-13 08:57:32

Answer 4

A:

Hello! Perhaps you can help me-

How can I extract DNA sequence from genome browser (UCSC) if I have coordinates on chromosomes (there are very much coordinates)?

Thank you!

fed 2010-04-29 08:17:25

ansaurus

tags:

views:

answers:

How can I extract start and end codon from DNA sequences in Perl?

related questions