views:

295

answers:

10

I would like to generate a random string (or a series of random strings, repetitions allowed) of length between 1 and n characters from some (finite) alphabet. Each string should be equally likely (in other words, the strings should be uniformly distributed).

The uniformity requirement means that an algorithm like this doesn't work:

alphabet = "abcdefghijklmnopqrstuvwxyz"
len = rand(1, n)
s = ""
for(i = 0; i < len; ++i)
    s = s + alphabet[rand(0, 25)]

(pseudo code, rand(a, b) returns a integer between a and b, inclusively, each integer equally likely)

This algorithm generates strings with uniformly distributed lengths, but the actual distribution should be weighted toward longer strings (there are 26 times as many strings with length 2 as there are with length 1, and so on.) How can I achieve this?

A: 

May be it makes sense to use boost::uuid?

Dmitriy
Did you read the comments, or the question ?
Romain Hippeau
+4  A: 

Instead of picking a length with uniform distribution, weight it according to how many strings are a given length. If your alphabet is size m, there are mx strings of size x, and (1-mn+1)/(1-m) strings of length n or less. The probability of choosing a string of length x should be mx*(1-m)/(1-mn+1).

Edit:

Regarding overflow - using floating point instead of integers will expand the range, so for a 26-character alphabet and single-precision floats, direct weight calculation shouldn't overflow for n<26.

A more robust approach is to deal with it iteratively. This should also minimize the effects of underflow:

int randomLength() {
  for(int i = n; i > 0; i--) {
    double d = Math.random();
    if(d > (m - 1) / (m - Math.pow(m, -i))) {
      return i;
    }
  }
  return 0;
}

To make this more efficient by calculating fewer random numbers, we can reuse them by splitting intervals in more than one place:

int randomLength() {
  for(int i = n; i > 0; i -= 5) {
    double d = Math.random();
    double c = (m - 1) / (m - Math.pow(m, -i))
    for(int j = 0; j < 5; j++) {
      if(d > c) {
        return i - j;
      }
      c /= m;
    }
  }
  for(int i = n % 0; i > 0; i--) {
    double d = Math.random();
    if(d > (m - 1) / (m - Math.pow(m, -i))) {
      return i;
    }
  }
  return 0;
}
Adam Crume
And how do you propose to do that without overflows?
Moron
Better, but imprecise. Overflows can be dealt with multiple random number calls and should be exact. Of course, using floats/doubles reduces the number of calls and could be useful if precision is not an issue. Besides, the random number generators themselves are imprecise, so cannot find fault with this.
Moron
Another way to deal with the exponentials would be to decompose it into multiplication of logarithms and then exping the final result to get the answer.
arnsholt
This is basically a probability density function. The problem is that with longer max lengths, the probability of short strings probably (heh) becomes so low that it rounds to 0 in floating point math, meaning it will never generate those strings at all.
Nick Johnson
+9  A: 

What you need to do is generate your length and then your string as two distinct steps. You will need to first chose the length using a weighted approach. You can calculate the number of strings of a given length l for an alphabet of k symbols as k^l. Sum those up and then you have the total number of strings of any length, your first step is to generate a random number between 1 and that value and then bin it accordingly. Modulo off by one errors you would break at 26, 26^2, 26^3, 26^4 and so on. The logarithm based on the number of symbols would be useful for this task.

Once you have you length then you can generate the string as you have above.

Ukko
+4  A: 

Okay, there are 26 possibilities for a 1-character string, 262 for a 2-character string, and so on up to 2626 possibilities for a 26-character string.

That means there are 26 times as many possibilities for an (N)-character string than there are for an (N-1)-character string. You can use that fact to select your length:

def getlen(maxlen):
    sz = maxlen
    while sz != 1:
        if rnd(27) != 1:
            return sz
        sz--;
    return 1

I use 27 in the above code since the total sample space for selecting strings from "ab" is the 26 1-character possibilities and the 262 2-character possibilities. In other words, the ratio is 1:26 so 1-character has a probability of 1/27 (rather than 1/26 as I first answered).

This solution isn't perfect since you're calling rnd multiple times and it would be better to call it once with an possible range of 26N+26N-1+261 and select the length based on where your returned number falls within there but it may be difficult to find a random number generator that'll work on numbers that large (10 characters gives you a possible range of 2610+...+261 which, unless I've done the math wrong, is 146,813,779,479,510).

If you can limit the maximum size so that your rnd function will work in the range, something like this should be workable:

def getlen(chars,maxlen):
    assert maxlen >= 1
    range = chars
    sampspace = 0
    for i in 1 .. maxlen:
        sampspace = sampspace + range
        range = range * chars
    range = range / chars
    val = rnd(sampspace)
    sz = maxlen
    while val < sampspace - range:
        sampspace = sampspace - range
        range = range / chars
        sz = sz - 1
    return sz

Once you have the length, I would then use your current algorithm to choose the actual characters to populate the string.


Explaining it further:

Let's say our alphabet only consists of "ab". The possible sets up to length 3 are [ab] (2), [ab][ab] (4) and [ab][ab][ab] (8). So there is a 8/14 chance of getting a length of 3, 4/14 of length 2 and 2/14 of length 1.

The 14 is the magic figure: it's the sum of all 2n for n = 1 to the maximum length. So, testing that pseudo-code above with chars = 2 and maxlen = 3:

    assert maxlen >= 1 [okay]
    range = chars [2]
    sampspace = 0
    for i in 1 .. 3:
        i = 1:
            sampspace = sampspace + range [0 + 2 = 2]
            range = range * chars [2 * 2 = 4]
        i = 2:
            sampspace = sampspace + range [2 + 4 = 6]
            range = range * chars [4 * 2 = 8]
        i = 3:
            sampspace = sampspace + range [6 + 8 = 14]
            range = range * chars [8 * 2 = 16]
    range = range / chars [16 / 2 = 8]
    val = rnd(sampspace) [number from 0 to 13 inclusive]
    sz = maxlen [3]
    while val < sampspace - range: [see below]
        sampspace = sampspace - range
        range = range / chars
        sz = sz - 1
    return sz

So, from that code, the first iteration of the final loop will exit with sz = 3 if val is greater than or equal to sampspace - range [14 - 8 = 6]. In other words, for the values 6 through 13 inclusive, 8 of the 14 possibilities.

Otherwise, sampspace becomes sampspace - range [14 - 8 = 6] and range becomes range / chars [8 / 2 = 4].

Then the second iteration of the final loop will exit with sz = 2 if val is greater than or equal to sampspace - range [6 - 4 = 2]. In other words, for the values 2 through 5 inclusive, 4 of the 14 possibilities.

Otherwise, sampspace becomes sampspace - range [6 - 4 = 2] and range becomes range / chars [4 / 2 = 2].

Then the third iteration of the final loop will exit with sz = 1 if val is greater than or equal to sampspace - range [2 - 2 = 0]. In other words, for the values 0 through 1 inclusive, 2 of the 14 possibilities (this iteration will always exit since the value must be greater than or equal to zero.


In retrospect, that second solution is a bit of a nightmare. In my personal opinion, I'd go for the first solution for its simplicity and to avoid the possibility of rather large numbers.

paxdiablo
@paxdiablo: Did you mean 26 to power of maxlen? This does not look right. Chances of getting length = n = maxlen are (26^n - 26^(n-1))/26^n with this, which is incorrect. Shouldn't it be 26^n/(26 + 26^2 + ... + 26^n)?
Moron
Yeah, like I'm going to listen to someone called @Moron :-) No, seriously, you're right, I did mean 26 rather than 2. Obviously I've been in the IT industry too long. Thanks for pointing that out, fixed now.
paxdiablo
@paxdiablo: :-) What I also meant was that even with 26 there it seems wrong, as the probability for getting maxlen returned seems to be different from what we need. Of course I might have miscalculated.
Moron
I think it's right, @Moron. In the first iteration, you have a 25/26 chance of getting a size of maxlen (you only move to the next iteration on the 1/26 chance). Although having said that, I think it should be 27, same as the first solution). I'll try to explain better in an edit.
paxdiablo
And, in trying to explain it better, I find you're actually right and I'm wrong. "Who's the moron now?", I hear my wife chortle from the other side of the room :-) Anyway, it's fixed now but I'd still use the first option since it's much simpler.
paxdiablo
+5  A: 

Building on my comment posted as a reply to the OP:

I'd consider it an exercise in base conversion. You're simply generating a "random number" in "base 26", where a=0 and z=25. For a random string of length n, generate a number between 1 and 26^n. Convert from base 10 to base 26, using symbols from your chosen alphabet.

Here's a PHP implementation. I won't guaranty that there isn't an off-by-one error or two in here, but any such error should be minor:

<?php
$n = 5;

var_dump(randstr($n));

function randstr($maxlen) {
        $dict = 'abcdefghijklmnopqrstuvwxyz';
        $rand = rand(0, pow(strlen($dict), $maxlen));
        $str = base_convert($rand, 10, 26);
        //base convert returns base 26 using 0-9 and 15 letters a-p(?)
        //we must convert those to our own set of symbols
        return strtr($str, '1234567890abcdefghijklmnopqrstuvwxyz', $dict);
}
Frank Farmer
For a 32-bit signed integer and a 36-character alphabet, that'll overflow for n>5.
Adam Crume
Actually, it works up to values of 13 for `$n` for me. Beyond that it just produces `'j'` as output. :D
deceze
Large integers can be dealt with multiple random number calls. This approach will allow us to use O(logn) random number calls instead of Omega(n).
Moron
This is easily the best approach, IMHO. Calculating distributions and probability densities is massively overcomplicating the situation. Generate a random number between 0..26^n, convert to base-26. Anything else is overkill in my opinion.
Cowan
A: 
// Note space as an available char
alphabet = "abcdefghijklmnopqrstuvwxyz "

result_string = ""

for( ;; )
{
    s = ""

    for( i = 0; i < n; i++ )
        s += alphabet[rand(0, 26)]

    first_space = n;

    for( i = 0; i < n; i++ )
        if( s[ i ] == ' ' )
        {
            first_space = i;
            break;
        }

    ok = true;

    // Reject "duplicate" shorter strings
    for( i = first_space + 1; i < n; i++ )
        if( s[ i ] != ' ' )
        {
            ok = false;
            break;
        }

    if( !ok )
        continue;

    // Extract the short version of the string
    for( i = 0; i < first_space; i++ )
        result_string += s[ i ];

    break;
}

Edit: I forgot to disallow 0-length strings, that will take a bit more code which I don't have time to add now.

Edit: After considering how my answer doesn't scale to large n (takes too long to get lucky and find an accepted string), I like paxdiablo's answer much better. Less code too.

Conrad Albrecht
+1  A: 

Personally I'd do it like this:

Let's say your alphabet has Z characters. Then the number of possible strings for each length L is:

L | Z
--------------------------
1 | 26
2 | 676 (= 26 * 26)
3 | 17576 (= 26 * 26 * 26)

...and so on.

Now let's say your maximum desired length is N. Then the total number of possible strings from length 1 to N that your function could generate would be the sum of a geometric sequence:

(1 - (Z ^ (N + 1))) / (1 - Z) 

Let's call this value S. Then the probability of generating a string of any length L should be:

(Z ^ L) / S

OK, fine. This is all well and good; but how do we generate a random number given a non-uniform probability distribution?

The short answer is: you don't. Get a library to do that for you. I develop mainly in .NET, so one I might turn to would be Math.NET.

That said, it's really not so hard to come up with a rudimentary approach to doing this on your own.

Here's one way: take a generator that gives you a random value within a known uniform distribution, and assign ranges within that distribution of sizes dependent on your desired distribution. Then interpret the random value provided by the generator by determining which range it falls into.

Here's an example in C# of one way you could implement this idea (scroll to the bottom for example output):

RandomStringGenerator class

public class RandomStringGenerator
{
    private readonly Random _random;
    private readonly char[] _alphabet;

    public RandomStringGenerator(string alphabet)
    {
        if (string.IsNullOrEmpty(alphabet))
            throw new ArgumentException("alphabet");

        _random = new Random();
        _alphabet = alphabet.Distinct().ToArray();
    }

    public string NextString(int maxLength)
    {
        // Get a value randomly distributed between 0.0 and 1.0 --
        // this is approximately what the System.Random class provides.
        double value = _random.NextDouble();

        // This is where the magic happens: we "translate" the above number
        // to a length based on our computed probability distribution for the given
        // alphabet and the desired maximum string length.
        int length = GetLengthFromRandomValue(value, _alphabet.Length, maxLength);

        // The rest is easy: allocate a char array of the length determined above...
        char[] chars = new char[length];

        // ...populate it with a bunch of random values from the alphabet...
        for (int i = 0; i < length; ++i)
        {
            chars[i] = _alphabet[_random.Next(0, _alphabet.Length)];
        }

        // ...and return a newly constructed string.
        return new string(chars);
    }

    static int GetLengthFromRandomValue(double value, int alphabetSize, int maxLength)
    {
        // Looping really might not be the smartest way to do this,
        // but it's the most obvious way that immediately springs to my mind.
        for (int length = 1; length <= maxLength; ++length)
        {
            Range r = GetRangeForLength(length, alphabetSize, maxLength);
            if (r.Contains(value))
                return length;
        }

        return maxLength;
    }

    static Range GetRangeForLength(int length, int alphabetSize, int maxLength)
    {
        int L = length;
        int Z = alphabetSize;
        int N = maxLength;

        double possibleStrings = (1 - (Math.Pow(Z, N + 1)) / (1 - Z));
        double stringsOfGivenLength = Math.Pow(Z, L);
        double possibleSmallerStrings = (1 - Math.Pow(Z, L)) / (1 - Z);

        double probabilityOfGivenLength = ((double)stringsOfGivenLength / possibleStrings);
        double probabilityOfShorterLength = ((double)possibleSmallerStrings / possibleStrings);

        double startPoint = probabilityOfShorterLength;
        double endPoint = probabilityOfShorterLength + probabilityOfGivenLength;

        return new Range(startPoint, endPoint);
    }
}

Range struct

public struct Range
{
    public readonly double StartPoint;
    public readonly double EndPoint;

    public Range(double startPoint, double endPoint)
        : this()
    {
        this.StartPoint = startPoint;
        this.EndPoint = endPoint;
    }

    public bool Contains(double value)
    {
        return this.StartPoint <= value && value <= this.EndPoint;
    }
}

Test

static void Main(string[] args)
{
    const int N = 5;
    const string alphabet = "acegikmoqstvwy";
    int Z = alphabet.Length;

    var rand = new RandomStringGenerator(alphabet);

    var strings = new List<string>();
    for (int i = 0; i < 100000; ++i)
    {
        strings.Add(rand.NextString(N));
    }

    Console.WriteLine("First 10 results:");
    for (int i = 0; i < 10; ++i)
    {
        Console.WriteLine(strings[i]);
    }

    // sanity check
    double sumOfProbabilities = 0.0;

    for (int i = 1; i <= N; ++i)
    {
        double probability = Math.Pow(Z, i) / ((1 - (Math.Pow(Z, N + 1))) / (1 - Z));
        int numStrings = strings.Count(str => str.Length == i);

        Console.WriteLine("# strings of length {0}: {1} (probability = {2:0.00%})", i, numStrings, probability);

        sumOfProbabilities += probability;
    }

    Console.WriteLine("Probabilities sum to {0:0.00%}.", sumOfProbabilities);

    Console.ReadLine();
}

Output:

First 10 results:
wmkyw
qqowc
ackai
tokmo
eeiyw
cakgg
vceec
qwqyq
aiomt
qkyav
# strings of length 1: 1 (probability = 0.00%)
# strings of length 2: 38 (probability = 0.03%)
# strings of length 3: 475 (probability = 0.47%)
# strings of length 4: 6633 (probability = 6.63%)
# strings of length 5: 92853 (probability = 92.86%)
Probabilities sum to 100.00%.
Dan Tao
A: 

My idea regarding this is like:

you have 1-n length string.there 26 possible 1 length string,26*26 2 length string and so on. you can find out the percentage of each length string of the total possible strings.for example percentage of single length string is like

((26/(TOTAL_POSSIBLE_STRINGS_OF_ALL_LENGTH))*100).

similarly you can find out the percentage of other length strings. Mark them on a number line between 1 to 100.ie suppose percentage of single length string is 3 and double length string is 6 then number line single length string lies between 0-3 while double length string lies between 3-9 and so on. Now take a random number between 1 to 100.find out the range in which this number lies.I mean suppose for examplethe number you have randomly chosen is 2.Now this number lies between 0-3 so go 1 length string or if the random number chosen is 7 then go for double length string.

In this fashion you can see that length of each string choosen will be proportional to the percentage of the total number of that length string contribute to the all possible strings.

Hope I am clear. Disclaimer: I have not gone through above solution except one or two.So if it matches with some one solution it will be purely a chance. Also,I will welcome all the advice and positive criticism and correct me if I am wrong.

Thanks and regard Mawia

mawia
+2  A: 

Edit: This answer isn't quite right. See the bottom for a disproof. I'll leave it up for now in the hope someone can come up with a variant that fixes it.

It's possible to do this without calculating the length separately - which, as others have pointed out, requires raising a number to a large power, and generally seems like a messy solution to me.

Proving that this is correct is a little tough, and I'm not sure I trust my expository powers to make it clear, but bear with me. For the purposes of the explanation, we're generating strings of length at most n from an alphabet a of |a| characters.

First, imagine you have a maximum length of n, and you've already decided you're generating a string of at least length n-1. It should be obvious that there are |a|+1 equally likely possibilities: we can generate any of the |a| characters from the alphabet, or we can choose to terminate with n-1 characters. To decide, we simply pick a random number x between 0 and |a| (inclusive); if x is |a|, we terminate at n-1 characters; otherwise, we append the xth character of a to the string. Here's a simple implementation of this procedure in Python:

def pick_character(alphabet):
  x = random.randrange(len(alphabet) + 1)
  if x == len(alphabet):
    return ''
  else:
    return alphabet[x]

Now, we can apply this recursively. To generate the kth character of the string, we first attempt to generate the characters after k. If our recursive invocation returns anything, then we know the string should be at least length k, and we generate a character of our own from the alphabet and return it. If, however, the recursive invocation returns nothing, we know the string is no longer than k, and we use the above routine to select either the final character or no character. Here's an implementation of this in Python:

def uniform_random_string(alphabet, max_len):
  if max_len == 1:
    return pick_character(alphabet)
  suffix = uniform_random_string(alphabet, max_len - 1)
  if suffix:
    # String contains characters after ours
    return random.choice(alphabet) + suffix
  else:
    # String contains no characters after our own
    return pick_character(alphabet)

If you doubt the uniformity of this function, you can attempt to disprove it: suggest a string for which there are two distinct ways to generate it, or none. If there are no such strings - and alas, I do not have a robust proof of this fact, though I'm fairly certain it's true - and given that the individual selections are uniform, then the result must also select any string with uniform probability.

As promised, and unlike every other solution posted thus far, no raising of numbers to large powers is required; no arbitrary length integers or floating point numbers are needed to store the result, and the validity, at least to my eyes, is fairly easy to demonstrate. It's also shorter than any fully-specified solution thus far. ;)

If anyone wants to chip in with a robust proof of the function's uniformity, I'd be extremely grateful.

Edit: Disproof, provided by a friend:

dato: so imagine alphabet = 'abc' and n = 2
dato: you have 9 strings of length 2, 3 of length 1, 1 of length 0
dato: that's 13 in total
dato: so probability of getting a length 2 string should be 9/13
dato: and probability of getting a length 1 or a length 0 should be 4/13
dato: now if you call uniform_random_string('abc', 2)
dato: that transforms itself into a call to uniform_random_string('abc', 1)
dato: which is an uniform distribution over ['a', 'b', 'c', '']
dato: the first three of those yield all the 2 length strings
dato: and the latter produce all the 1 length strings and the empty strings
dato: but 0.75 > 9/13
dato: and 0.25 < 4/13
Nick Johnson
A: 

Matthieu: Your idea doesn't work because strings with blanks are still more likely to be generated. In your case, with n=4, you could have the string 'ab' generated as 'a' + 'b' + '' + '' or '' + 'a' + 'b' + '', or other combinations. Thus not all the strings have the same chance of appearing.

Isaac
Exact. Thanks for pointing this out.
Matthieu M.