+6  A: 

This sounds like roulette wheel selection, mainly used for the selection process in genetic/evolutionary algorithms.

Look at http://stackoverflow.com/questions/177271/roulette-selection-in-genetic-algorithms

seb
Yeah, this is exactly what algorithm is required. You're not going to get quicker than O(n) complexity for sure.
Noldorin
Ok. They're just using iterative search, which requires O(n log m) to select each, and a total work of O(n log m + pn log m), just like my algorithm 2. Thanks!
rampion
A: 

You could create the target array, then loop through the words determining the probability that it should be picked, and replace the words in the array according to a random number.

For the first word the probability would be f0/m0 (where mn=f0+..+fn), i.e. 100%, so all positions in the target array would be filled with w0.

For the following words the probability falls, and when you reach the last word the target array is filled with randomly picked words accoding to the frequency.

Example code in C#:

public class WordFrequency {

 public string Word { get; private set; }
 public int Frequency { get; private set; }

 public WordFrequency(string word, int frequency) {
  Word = word;
  Frequency = frequency;
 }

}

WordFrequency[] words = new WordFrequency[] {
 new WordFrequency("Hero", 80),
 new WordFrequency("Monkey", 4),
 new WordFrequency("Shoe", 13),
 new WordFrequency("Highway", 3),
};

int p = 7;
string[] result = new string[p];
int sum = 0;
Random rnd = new Random();
foreach (WordFrequency wf in words) {
 sum += wf.Frequency;
 for (int i = 0; i < p; i++) {
  if (rnd.Next(sum) < wf.Frequency) {
   result[i] = wf.Word;
  }
 }
}
Guffa
Right. This is exactly algorithm 2.
rampion
Is that what you meant? I was thrown off by the O() calculation. The frequency values are irrelevant for how much work there is, so the m has no business in the O() value. It should simply be O(np).
Guffa
No, the frequency values matter - it takes O(log m) bits to store a frequency, and O(log m) work to add two frequencies or compare two. Usually this is just swallowed by a constant term when log m < 64 (you store it in a 64 bit int), but for larger numbers, it can matter.
rampion
If you want that kind of complexity, then you have to consider the data size for every operation... Looping through the pairs is not an O(n) operation, but an O(n log n) operation... Creating an array is not an O(p) operation, but an O(p log p) operation...
Guffa
good point. I'll adjust my complexity descriptions accordingly.
rampion
A: 
rampion
Better implementation at http://gist.github.com/112858
rampion