views:

302

answers:

5

Given these inputs:

my $init_seq = "AAAAAAAAAA" #length 10 bp 
my $sub_rate = 0.003;
my $nof_tags = 1000;
my @dna = qw( A C G T );

I want to generate:

  1. One thousand length-10 tags

  2. Substitution rate for each position in a tag is 0.003

Yielding output like:

AAAAAAAAAA
AATAACAAAA
.....
AAGGAAAAGA # 1000th tags

Is there a compact way to do it in Perl?

I am stuck with the logic of this script as core:

#!/usr/bin/perl

my $init_seq = "AAAAAAAAAA" #length 10 bp 
my $sub_rate = 0.003;
my $nof_tags = 1000;
my @dna = qw( A C G T );

    $i = 0;
    while ($i < length($init_seq)) {
        $roll = int(rand 4) + 1;       # $roll is now an integer between 1 and 4

        if ($roll == 1) {$base = A;}
        elsif ($roll == 2) {$base = T;}
        elsif ($roll == 3) {$base = C;}
        elsif ($roll == 4) {$base = G;};

        print $base;
    }
    continue {
        $i++;
    }
+3  A: 

EDIT: Assuming substitution rate is in the range 0.001 to 1.000:

As well as $roll, generate another (pseudo)random number in the range [1..1000], if it is less than or equal to (1000 * $sub_rate) then perform the substitution, otherwise do nothing (i.e. output 'A').

Be aware that you may introduce subtle bias unless the properties of your random number generator are known.

Mitch Wheat
rand() returns a number in the range [0,1), so can be directly compared to $sub_rate without any 1000 *.
ysth
+4  A: 

As a small optimisation, replace:

    $roll = int(rand 4) + 1;       # $roll is now an integer between 1 and 4

    if ($roll == 1) {$base = A;}
    elsif ($roll == 2) {$base = T;}
    elsif ($roll == 3) {$base = C;}
    elsif ($roll == 4) {$base = G;};

with

    $base = $dna[int(rand 4)];
Penfold
+0. It's a nice optimisation, but it allows a "mutation" from a G to a G.
j_random_hacker
+2  A: 

Not exactly what you are looking for, but I suggest you take a look at BioPerl's Bio::SeqEvolution::DNAPoint module. It does not take mutation rate as a parameter though. Rather, it asks what the lower bound of sequence identity with the original you want.

use strict;
use warnings;
use Bio::Seq;
use Bio::SeqEvolution::Factory;

my $seq = Bio::Seq->new(-seq => 'AAAAAAAAAA', -alphabet => 'dna');

my $evolve = Bio::SeqEvolution::Factory->new (
   -rate     => 2,      # transition/transversion rate
   -seq      => $seq
   -identity => 50      # At least 50% identity with the original
);


my @mutated;
for (1..1000) { push @mutated, $evolve->next_seq }

All 1000 mutated sequences will be stored in the @mutated array, their sequences can be accessed via the seq method.

brunov
+1  A: 

In the event of a substitution, you want to exclude the current base from the possibilities:

my @other_bases = grep { $_ ne substr($init_seq, $i, 1) } @dna;
$base = @other_bases[int(rand 3)];

Also please see Mitch Wheat's answer for how to implement the substitution rate.

j_random_hacker
+1  A: 

I don't know if I understand correctly but I'd do something like this (pseudocode):

digits = 'ATCG'
base = 'AAAAAAAAAA'
MAX = 1000
for i = 1 to len(base)
  # check if we have to mutate
  mutate = 1+rand(MAX) <= rate*MAX
  if mutate then
    # find current A:0 T:1 C:2 G:3
    current = digits.find(base[i])
    # get a new position 
    # but ensure that it is not current
    new = (j+1+rand(3)) mod 4        
    base[i] = digits[new]
  end if
end for
Ismael