views:

164

answers:

4

Hello, I want to create a table, with each row containing some sort of weight. Then I want to select random values with the probability equal to (weight of that row)/(weight of all rows). For example, having 5 rows with weights 1,2,3,4,5 out of 1000 I'd get approximately 1/15*1000=67 times first row and so on.

The table is to be filled manually. Then I'll take a random value from it. But I want to have an ability to change the probabilities on the filling stage.

+1  A: 

The easiest (and maybe best/safest?) way to do this is to add those rows to the table as many times as you want the weight to be - say I want "Tree" to be found 2x more often then "Dog" - I insert it 2 times into the table and I insert "Dog" once and just select elements at random one by one.

If the rows are complex/big then it would be best to create a separate table (weighted_Elements or something) in which you'll just have foreign keys to the real rows inserted as many times as the weights dictate.

RnR
+1 I also prefer to count in unary.
jleedev
It tends to lead to the most efficient approach too doesn't it? ;) (at least considering the current options) Especially when selections are more frequent then insertions/re-weights (which is probably a typical scenario). It's a single query instead of a linear loop through all the elements each time and it will help avoid a nasty complexity problem when trying to get X random elements with weights Y times a second... :)
RnR
Actually, if you want to get X random elements, you just have to sort by writing a clever comparison function (although I don't know how easy that is in SQL). http://code.google.com/p/quodlibet/source/browse/quodlibet/quodlibet/browsers/search.py#117
jleedev
But this one is definitely more natural in SQL and more efficient for this problem.
jleedev
+2  A: 

I found this nice little algorithm in Quod Libet. You could probably translate it to some procedural SQL.

function WeightedShuffle(list of items with weights):
  max_score ← the sum of every item’s weight
  choice ← random number in the range [0, max_score)
  current ← 0
  for each item (i, weight) in items:  
    current ← current + weight  
    if current ≥ choice or i is the last item:  
      return item i
jleedev
+1 for where you found it, I just love Quod Libet.
AttishOculus
+1  A: 

The best possible scenario (if i understand your question properly) is to setup your table as you normally would and then add two columns both INT's.

  • Column 1: Weight - This column would hold your weight value going from -X to +X, X being the highest value you want to have as a weight (IE: X=100, -100 to 100). This value is populated to give the row an actual weight and increase or decrease the probability of it coming up.

  • Column 2: *Count** - This column would hold the count of how many times this row has come up, this column is needed only if you want to use fair weighting. Fair weighting prevents one row from always showing up. (IE: if you have one row weighted at 100 and another at 2 the row with 100 will always show up, this column will allow weight 2 to be more 'valueable' as you get more weight 100 results). This column should be incremented by 1 each time a row result is pulled but you can make the logic more advanced later so it adds the weight etc.

  • Logic: - Its really simple now, your query simply has to request all rows as you normally would then make an extra select that (you can change the logic here to whatever you want) takes the weights and subtracts the count and order by that column.

The end result should be a table where you will get your weights appearing more often until a certain point where the system will evenly distribute itself out (leave out column 2) and you will have a system that will always return the same weighted order unless you offset the base of the query (IE: LIMIT [RANDOM NUMBER], [NUMBER OF ROWS TO RETURN])

Zyris Development Team
+1  A: 

I'm not an expert in probability theory, but assuming you have a column called WEIGHT, how about

select FIELD_1, ... FIELD_N, (rand() * WEIGHT) as SCORE
  from YOURTABLE
 order by SCORE
 limit 0, 10

This would give you 10 records, but you can change the limit clause, of course.

Tom Bartel
Downvote. It is not difficult to see that it fails to generate the right results with weights (1,2,2,2,2,2) (in particular, the chance of selecting the row with weight 1 is 1/32 if it happens to roll rand()=1, which is pretty unlikely anyway).
tc.