views:

387

answers:

5

How can I randomly generate letters according to their frequency of use in common speech?

Any pseudo-code appreciated, but an implementation in Java would be fantastic. Otherwise just a poke in the right direction would be helpful.

Note: I don't need to generate the frequencies of usage - I'm sure I can look that up easily enough.

+7  A: 

One quick way to do it would be to generate a list of letters, where each letter appeared in the list in accordance with its frequency. Say, if "e" was used 25.6% of the time, and your list had length 1000, it would have 256 "e"s.

Then you could just randomly pick spots from the list by using (int) (Math.random() * 1000) to generate random numbers between 0 and 999.

danben
+1 for the same solution I was going to suggest.
Michael Krauklis
The best way to match the letter frequency of a particular text :-)
phkahler
+1 It's a good suggestion, but not ideal if you have characters that occur with very small frequencies (e.g. 0.00001 or less). I guess it depends on what you need.
Mark Byers
This has obvious precision limits, but might be preferable because it is so simple to implement and understand.
Schamp
+8  A: 

I am assuming that you store the frequencies as floating point numbers between 0 and 1 that total to make 1.

First you should prepare a table of cumulative frequencies, i.e. the sum of the frequency of that letter and all letters before it.

To simplify, if you start with this frequency distribution:

A  0.1
B  0.3
C  0.4
D  0.2

Your cumulative frequency table would be:

A  0.1
B  0.4 (= 0.1 + 0.3)
C  0.8 (= 0.1 + 0.3 + 0.4)
D  1.0 (= 0.1 + 0.3 + 0.4 + 0.2)

Now generate a random number between 0 and 1 and see where in this list that number lies. Choose the letter that has the smallest cumulative frequency larger than your random number. Some examples:

Say you randomly pick 0.612. This lies between 0.4 and 0.8, i.e. between B and C, so you'd choose C.

If your random number was 0.039, that comes before 0.1, i.e. before A, so choose A.

I hope that makes sense, otherwise feel free to ask for clarifications!

Mark Byers
+2  A: 

Not even a pseudo-code, but a possible approach is as follows:

Let p1, p2, ..., pk be the frequencies that you want to match.

  1. Calculate the cumulative frequencies: p1, p1+p2, p1+p2+p3, ... , 1
  2. Generate a random uniform (0,1) number x
  3. Check which interval of the cumulative frequencies x belongs to: if it is between, say, p1+..+pi and p1+...+pi+p(i+1), then output the (i+1)st letter

Depending on how you implement the interval-finding, the procedure is usually more efficient if the p1,p2,... are sorted in decreasing order, because you will usually find the interval containing x sooner.

Aniko
+2  A: 

What I would do is scale the relative frequencies as floating point numbers such that their sum is 1.0. Then I would create an array of the cumulative totals per letter, i.e. the number that must be topped to get that letter and all those "below" it. Say the frequency of A is 10%, b is 2% and z is 1%; then your table would look something like this:

0.000 A ; from 0% to 10% gets you an A
0.100 B ; above 10% is at least a B
0.120 C ; 12% for C...
...
0.990 Z ; if your number is >= 99% then you get a Z

Then you generate yourself a random number between 0.0 and 1.0 and do a binary search in the array for the first number smaller than your random number. Then pick the letter at that position. Done.

Carl Smotricz
Curse you, Mark Byers! Better explained answer, 3 minutes faster than me! ;)
Carl Smotricz
+1 anyway for mentioning binary search. :)
Mark Byers
A: 

Using a binary tree gives you a nice, clean way to find the right entry. Here, you start with a frequency map, where the keys are the symbols (English letters), and the values are the frequency of their occurrence. This gets inverted, and a NavigableMap is created where the keys are cumulative probability, and the values are symbols. That makes the lookup easy.

  private final Random generator = new Random();

  private final NavigableMap<Float, Integer> table = 
    new TreeMap<Float, Integer>();

  private final float max;

  public Frequency(Map<Integer, Float> frequency)
  {
    float total = 0;
    for (Map.Entry<Integer, Float> e : frequency.entrySet()) {
      total += e.getValue();
      table.put(total, e.getKey());
    }
    max = total;
  }

  /** 
   * Choose a random symbol. The choices are weighted by frequency.
   */ 
  public int roll()
  {
    Float key = generator.nextFloat() * max;
    return table.higherEntry(key).getValue();
  }
erickson