views:

773

answers:

6

For example, The function could be something like def RandABCD(n, .25, .34, .25, .25):

Where n is the length of the string to be generated and the following numbers are the desired probabilities of A, B, C, D.

I would imagine this is quite simple, however i am having trouble creating a working program. Any help would be greatly appreciated.

+2  A: 

The random class is quite powerful in python. You can generate a list with the characters desired at the appropriate weights and then use random.choice to obtain a selection.

First, make sure you do an import random.

For example, let's say you wanted a truly random string from A,B,C, or D. 1. Generate a list with the characters li = ['A','B','C','D']

  1. Then obtain values from it using random.choice output = "".join([random.choice(li) for i in range(0, n)])

You could easily make that a function with n as a parameter.

In the above case, you have an equal chance of getting A,B,C, or D.

You can use duplicate entries in the list to give characters higher probabilities. So, for example, let's say you wanted a 50% chance of A and 25% chances of B and C respectively. You could have an array like this:

li = ['A','A','B','C']

And so on.

It would not be hard to parameterize the characters coming in with desired weights, to model that I'd use a dictionary.

characterbasis = {'A':25, 'B':25, 'C':25, 'D':25}

Make that the first parameter, and the second being the length of the string and use the above code to generate your string.

David in Dakota
+2  A: 

For four letters, here's something quick off the top of my head:

from random import random

def randABCD(n, pA, pB, pC, pD):
    # assumes pA + pB + pC + pD == 1
    cA = pA
    cB = cA + pB
    cC = cB + pC
    def choose():
        r = random()
        if r < cA:
           return 'A'
        elif r < cB:
           return 'B'
        elif r < cC:
           return 'C'
        else:
           return 'D'
    return ''.join([choose() for i in xrange(n)])

I have no doubt that this can be made much cleaner/shorter, I'm just in a bit of a hurry right now.

The reason I wouldn't be content with David in Dakota's answer of using a list of duplicate characters is that depending on your probabilities, it may not be possible to create a list with duplicates in the right numbers to simulate the probabilities you want. (Well, I guess it might always be possible, but you might wind up needing a huge list - what if your probabilities were 0.11235442079, 0.4072777384, 0.2297927874, 0.25057505341?)

EDIT: here's a much cleaner generic version that works with any number of letters with any weights:

from bisect import bisect
from random import uniform

def rand_string(n, content):
    ''' Creates a string of letters (or substrings) chosen independently
        with specified probabilities. content is a dictionary mapping
        a substring to its "weight" which is proportional to its probability,
        and n is the desired number of elements in the string.

        This does not assume the sum of the weights is 1.'''
    l, cdf = zip(*[(l, w) for l, w in content.iteritems()])
    cdf = list(cdf)
    for i in xrange(1, len(cdf)):
        cdf[i] += cdf[i - 1]
    return ''.join([l[bisect(cdf, uniform(0, cdf[-1]))] for i in xrange(n)])
David Zaslavsky
A: 

Here is a rough idea of what might suit you

import random as r

def distributed_choice(probs):
    r= r.random()
    cum = 0.0

    for pair in probs:
     if (r < cum + pair[1]):
      return pair[0]      
     cum += pair[1]

The parameter probs takes a list of pairs of the form (object, probability). It is assumed that the sum of probabilities is 1 (otherwise, its trivial to normalize).

To use it just execute:

''.join([distributed_choice(probs)]*4)
Il-Bhima
+4  A: 

Here's the code to select a single weighted value. You should be able to take it from here. It uses bisect and random to accomplish the work.

from bisect import bisect
from random import random

def WeightedABCD(*weights):
  chars = 'ABCD'
  breakpoints = [sum(weights[:x+1]) for x in range(4)]
  return chars[bisect(breakpoints, random())]

Call it like this: WeightedABCD(.25, .34, .25, .25).

EDIT: Here is a version that works even if the weights don't add up to 1.0:

from bisect import bisect_left
from random import uniform

def WeightedABCD(*weights):
  chars = 'ABCD'
  breakpoints = [sum(weights[:x+1]) for x in range(4)]
  return chars[bisect_left(breakpoints, uniform(0.0,breakpoints[-1]))]
Pesto
That'll only work correctly if sum(weights)==1.0, no?
Anthony Towns
Yes, but I'm assuming that the weights /should/ equal 1.0. I can't perceive of a reason to do otherwise, but you could always normalize them to a 0.0 - 1.0 scale first if you must.
Pesto
Alternatively, you could change the last line to `return chars[bisect_left(breakpoints, uniform(0.0, breakpoints[-1]))]` and not worry about normalizing. Just remember to change the import statements.
Pesto
Your example, 0.25,0.34,0.25,0.25, didn't add up to 1.0 :)
Anthony Towns
I just copied over the original example from the question and didn't pay any attention to it. Still, the second version will handle non-normalized weights.
Pesto
A: 

Hmm, something like:

import random
class RandomDistribution:
    def __init__(self, kv):
        self.entries = kv.keys()
        self.where = []
        cnt = 0
        for x in self.entries:
            self.where.append(cnt)
            cnt += kv[x]
        self.where.append(cnt)   

    def find(self, key):
        l, r = 0, len(self.where)-1
        while l+1 < r:
           m = (l+r)/2
           if self.where[m] <= key:
               l=m
           else:
               r=m
        return self.entries[l]

    def randomselect(self):
        return self.find(random.random()*self.where[-1])

rd = RandomDistribution( {"foo": 5.5, "bar": 3.14, "baz": 2.8 } )
for x in range(1000):
    print rd.randomselect()

should get you most of the way...

Anthony Towns
A: 

Thank you all for your help!!! I was able to figure something out, mostly with this info. For my particular need, i did somthing like this:

import random
#Create a funtion to randomize a given string
def makerandom(seq):
    return ''.join(random.sample(seq, len(seq)))
def randomDNA(n, probA=0.25, probC=0.25, probG=0.25, probT=0.25):
    notrandom=''
    A=int(n*probA)
    C=int(n*probC)
    T=int(n*probT)
    G=int(n*probG)

#The remainder part here is used to make sure all n are used, as one cannot
#have half an A for example.
    remainder=''
    for i in range(0, n-(A+G+C+T)):
        ramainder+=random.choice("ATGC")
    notrandom=notrandom+ 'A'*A+ 'C'*C+ 'G'*G+ 'T'*T + remainder
    return makerandom(notrandom)
BigTree
I don't like this question but I like the idea of generating random DNA... what was your use for this?Thanks!
Manuel