views:

276

answers:

5

Hello,

I'm trying to make a glob-like expansion of a set of DNA strings that have multiple possible bases.

The base of my DNA strings contains the letters A, C, G, and T. However, I can have special characters like M which could be an A or a C.

For example, say I have the string:

ATMM

I would like to take this string as input and output the four possible matching strings:

ATAA ATAC ATCA ATCC

Rather than brute force a solution, I feel like there must be some elegant Python/Perl/Regular Expression trick to do this.

Thank you for any advice.

Edit, thanks cortex for the product operator. This is my solution:

Still a Python newbie, so I bet there's a better way to handle each dictionary key than another for loop. Any suggestions would be great.

import sys
from itertools import product

baseDict = dict(M=['A','C'],R=['A','G'],W=['A','T'],S=['C','G'],
                  Y=['C','T'],K=['G','T'],V=['A','C','G'],
                  H=['A','C','T'],D=['A','G','T'],B=['C','G','T'])
def glob(str):
    strings = [str]

    ## this loop visits very possible base in the dictionary
    ## probably a cleaner way to do it
    for base in baseDict:
        oldstrings = strings
        strings = []
        for string in oldstrings:
            strings += map("".join,product(*[baseDict[base] if x == base 
                                 else [x] for x in string]))
    return strings

for line in sys.stdin.readlines():
    line = line.rstrip('\n')
    permutations = glob(line)
    for x in permutations:
        print x
A: 

This isn't really an "expansion" problem and it's almost certainly not doable with any sensible regular expression.

I believe what you're looking for is "how to generate permutations".

Alnitak
I was thinking of expansion in terms of you would say a glob expands the matches of a character set. But, you're right it does feel more like a permutation.
Rich
A: 

You could for example do this recursively. Pseudo-code:

printSequences(sequence s)
  switch "first special character in sequence"
    case ...
    case M:
      s1 = s, but first M replaced with A
      printSequences(s1)
      s2 = s, but first M replaced with C
      printSequences(s2)
    case none:
      print s;
schnaader
+1  A: 

You probably could do something like this in python using the yield operator

def glob(str):
      if str=='':           
          yield ''
          return      

      if str[0]!='M':
          for tail in glob(str[1:]): 
              yield str[0] + tail                  
      else:
         for c in ['A','G','C','T']:
             for tail in glob(str[1:]):
                 yield c + tail                 
      return

EDIT: As correctly pointed out I was making a few mistakes. Here is a version which I tried out and works.

Il-Bhima
I'm a Python newbie so I don't know too much about generators. I get the idea of this, but I don't understand the lines that concatenate a character to the generator. Is that possible?
Rich
A: 

Regexps match strings, they're not intended to be turned into every string they might match.

Also, you're looking at a lot of strings being output from this - for instance:

MMMMMMMMMMMMMMMM (16 M's)

produces 65,536 16 character strings - and I'm guessing that DNA sequences are usually longer than that.

Arguably any solution to this is pretty much 'brute force' from a computer science perspective, because your algorithm is O(2^n) on the original string length. There's actually quite a lot of work to be done.

Why do you want to produce all the combinations? What are you going to do with them? (If you're thinking to produce every string possibility and then look for it in a large DNA sequence, then there are much better ways of doing that.)

ijw
Fortunately I'm looking through about 80 small snippets of DNA, none of which have more than 8 of these generating bases. Once the entire list is expanded it should easily fit on disk.You're right, this is part of a project that compares these strings against a larger DNA sequence, but I've already done that by directly using the compressed format. For political reasons I now need to generate the entire list of sequences.
Rich
+2  A: 

Agree with other posters that it seems like a strange thing to want to do. Of course, if you really want to, there is (as always) an elegant way to do it in Python (2.6+):

from itertools import product
map("".join, product(*[['A', 'C'] if x == "M" else [x] for x in "GMTTMCA"]))

Full solution with input handling:

import sys
from itertools import product

base_globs = {"M":['A','C'], "R":['A','G'], "W":['A','T'],
              "S":['C','G'], "Y":['C','T'], "K":['G','T'],

              "V":['A','C','G'], "H":['A','C','T'],
              "D":['A','G','T'], "B":['C','G','T'],
              }

def base_glob(glob_sequence):
    production_sequence = [base_globs.get(base, [base]) for base in glob_sequence]
    return map("".join, product(*production_sequence))

for line in sys.stdin.readlines():
    productions = base_glob(line.strip())
    print "\n".join(productions)
Joakim Lundborg
This looks interesting, I'll see if I can get it do what I want.
Rich
I think I got something working. If you don't mind, could you look at the solution and let me know if there's a cleaner way to handle the outer for loop?
Rich
Sure thing, I added my version of the full solution, with some more pythonicisms thrown in for good measure. BTW: You are missing quite a few permutiations in your glob grammar:from itertools import permutationslist(permutations("ACGT", 2))list(permutations("ACGT", 2))
Joakim Lundborg
instead of "base_globs[base] if base in base_globs else [base]" you could use "base_globs.get(x, [x])", it's a little bit more concise...
fortran