ansaurus

Question

Looking for elegant glob-like DNA string expansion

Answer 1

A:

This isn't really an "expansion" problem and it's almost certainly not doable with any sensible regular expression.

I believe what you're looking for is "how to generate permutations".

Alnitak 2009-07-08 14:31:13

I was thinking of expansion in terms of you would say a glob expands the matches of a character set. But, you're right it does feel more like a permutation.

Rich 2009-07-08 17:52:26

Answer 2

A:

You could for example do this recursively. Pseudo-code:

printSequences(sequence s)
  switch "first special character in sequence"
    case ...
    case M:
      s1 = s, but first M replaced with A
      printSequences(s1)
      s2 = s, but first M replaced with C
      printSequences(s2)
    case none:
      print s;

schnaader 2009-07-08 14:35:40

Answer 3

+1 A:

You probably could do something like this in python using the yield operator

def glob(str):
      if str=='':           
          yield ''
          return      

      if str[0]!='M':
          for tail in glob(str[1:]): 
              yield str[0] + tail                  
      else:
         for c in ['A','G','C','T']:
             for tail in glob(str[1:]):
                 yield c + tail                 
      return

EDIT: As correctly pointed out I was making a few mistakes. Here is a version which I tried out and works.

Il-Bhima 2009-07-08 14:37:12

I'm a Python newbie so I don't know too much about generators. I get the idea of this, but I don't understand the lines that concatenate a character to the generator. Is that possible?

Rich 2009-07-08 18:06:11

Answer 4

A:

Regexps match strings, they're not intended to be turned into every string they might match.

Also, you're looking at a lot of strings being output from this - for instance:

MMMMMMMMMMMMMMMM (16 M's)

produces 65,536 16 character strings - and I'm guessing that DNA sequences are usually longer than that.

Arguably any solution to this is pretty much 'brute force' from a computer science perspective, because your algorithm is O(2^n) on the original string length. There's actually quite a lot of work to be done.

Why do you want to produce all the combinations? What are you going to do with them? (If you're thinking to produce every string possibility and then look for it in a large DNA sequence, then there are much better ways of doing that.)

ijw 2009-07-08 14:45:23

Fortunately I'm looking through about 80 small snippets of DNA, none of which have more than 8 of these generating bases. Once the entire list is expanded it should easily fit on disk.You're right, this is part of a project that compares these strings against a larger DNA sequence, but I've already done that by directly using the compressed format. For political reasons I now need to generate the entire list of sequences.

Rich 2009-07-08 17:49:27

Answer 5

+2 A:

Agree with other posters that it seems like a strange thing to want to do. Of course, if you really want to, there is (as always) an elegant way to do it in Python (2.6+):

from itertools import product
map("".join, product(*[['A', 'C'] if x == "M" else [x] for x in "GMTTMCA"]))

Full solution with input handling:

import sys
from itertools import product

base_globs = {"M":['A','C'], "R":['A','G'], "W":['A','T'],
              "S":['C','G'], "Y":['C','T'], "K":['G','T'],

              "V":['A','C','G'], "H":['A','C','T'],
              "D":['A','G','T'], "B":['C','G','T'],
              }

def base_glob(glob_sequence):
    production_sequence = [base_globs.get(base, [base]) for base in glob_sequence]
    return map("".join, product(*production_sequence))

for line in sys.stdin.readlines():
    productions = base_glob(line.strip())
    print "\n".join(productions)

Joakim Lundborg 2009-07-08 14:54:21

This looks interesting, I'll see if I can get it do what I want.

Rich 2009-07-08 17:50:12

I think I got something working. If you don't mind, could you look at the solution and let me know if there's a cleaner way to handle the outer for loop?

Rich 2009-07-08 18:59:38

Sure thing, I added my version of the full solution, with some more pythonicisms thrown in for good measure. BTW: You are missing quite a few permutiations in your glob grammar:from itertools import permutationslist(permutations("ACGT", 2))list(permutations("ACGT", 2))

Joakim Lundborg 2009-07-09 08:23:06

instead of "base_globs[base] if base in base_globs else [base]" you could use "base_globs.get(x, [x])", it's a little bit more concise...

fortran 2009-07-23 07:17:34

ansaurus

tags:

views:

answers:

Looking for elegant glob-like DNA string expansion

related questions