views:

156

answers:

7

I have a dataset of rows each with an 'odds' number between 1 and 100. I am looking to do it in the most efficient way possible. The odds do not necessarily add up to 100.

I have had a few ideas.

a) select the whole dataset and then add all the odds up and generate a random number between 1 and that number. Then loop through the dataset deducting the odds from the number until it is 0.

I was hoping to minimize the impact on the database so I considered if I could only select the rows I needed.

b)

SELECT * FROM table WHERE (100*RAND()) < odds

I considered LIMIT 0,1

but then if items have the same probability only one of the will be returned

alternatively take the whole dataset and pick a random one from there... but then the odds are effected as it becomes a random with odds and then a random without odds thus the odds become tilted in favour of the higher odds (even more so).

I guess I could order by odds ASC then take the whole dataset and then with php take a random out of the rows with the same odds as the first record (the lowest).

seems like a clumsy solution.

Does anyone have a superior solution, if not which one of the above is best?

A: 

I didn't try it, but maybe something like this (with ? a random number from 0 to SUM(odds) - 1)?

SET @prob := 0;

SELECT
  T.*,
  (@prob := @prob + T.odds) AS prob
FROM table T
WHERE prob > ?
LIMIT 1

This is basically the same as your idea a), but entirely within one (well, technically two if you count the variable set-up) SQL commands.

Amadan
This would be weighted towards the higher values. For example if you have values 1 and 5, only rand 0 would equal 1, but rand 2 through 4 would be 5.
Marcus Adams
@Marcus: I do not understand you. If you have values of 1 and 5, `prob` would be 1 and 6, respectively; in case of random number being 0 (1 chance in 6), it would select the first row, and in case it is 1-5 (5 chances in 6), it would select the second. Conversely, if you have 5 and then 1, `prob` would be 5 and 6; thus random number 0-4 would select the first row (5 chances in 6) and 5 would select the second (1 chance in 6), just as desired. Where is the bias?
Amadan
@Amadan, I don't see where the OP says that they want weighted random selection. If one of the two values has 5:1 odds, it would be biased. You don't think so?
Marcus Adams
Ah, I misunderstood your complaint. Yes, obviously it is weighted - that's what I understood the question to mean. Specifically, try rereading the title ("random row but with odds") and his "a)" attempt at the solution (which is exactly what I've written, but in SQL).
Amadan
@Amadan, I could be wrong, but I think the OP was complaining that the solution wasn't valid because it was weighted.
Marcus Adams
+2  A: 

Do some up-front work, add some columns to your table that help the selection. For example suppose you have these rows

 X  2  
 Y  3
 Z  1

We add some cumulative values

 Key Odds Start  End 
 X    2     0     1      // range 0->1, 2 values == odds
 Y    3     2     4      // range 2->4, 3 values == odds
 Z    1     5     5      // range 5->5, 1 value == odds

Start and End are chosen as follows. The first row has a start of zero. Subsequent rows have a start one more than previous end. End is the (Start + Odds - 1).

Now pick a random number R in the range 0 to Max(End)

Select * from T where R >= T.Start and R <= T.End

If the database is sufficiently clever we may we be able to use

 Select * from T where R >= T.Start and R <= (T.Start + T.Odds - 1)

I'm speculating that having an End column with an index may give the better performance. Also the Max(End) perhaps gets stashed somewhere and updated by a trigger when ncessary.

Clearly there's some hassle in updating the Start/End. This may not be too bad if either

  • The table contents are stable
  • or insertions are in someway naturally ordered, so that each new row just continues from the old highest.
djna
This is an interesting solution. It could be improved by using 'between' rather than '>= and <='.
Pablo
This also requires updating multiple records if one of the rows is deleted. I don't think "inserts" are an issue, as the order doesn't matter. This yields weighted results. Is that what the OP wanted?
Marcus Adams
Yes, weighted results is what I am looking for.
Pablo
A: 

If you have an index on the odds column, and a primary key, this would be very efficient:

SELECT id, odds FROM table WHERE odds > 0

The database wouldn't even have to read from the table, it would get everything it needed from the odds index.

Then, you'll select a random value between 1 and the number of rows returned.

Then select that row from the array of rows returned.

Then, finally, select the whole target row:

SELECT * FROM table WHERE id = ?

This assures an even distribution between all rows with an odds value.


Alternatively, put the odds in a different table, with an autoincrement primary key.

Odds
ID     odds
1      4
2      9
3      56
4      12

Store the ID foreign key in the main table instead of the odds value, and index it.

First, get the max value. This never touches the database. It uses the index:

SELECT MAX(ID) FROM Odds

Get a random value between 1 and the max.

Then select the record.

SELECT * FROM table
JOIN Odds ON Odds.ID = table.ID
WHERE Odds.ID >= ?
LIMIT 1

This will require some maintenance if you tend to delete Odds value or roll back inserts to keep the distribution even.

There is a whole chapter on random selection in the book SQL Antipatterns.

Marcus Adams
A: 

What if you took your code, and added an ORDER BY RAND() and LIMIT 1?

SELECT * FROM table WHERE (100*RAND()) < odds ORDER BY RAND() LIMIT 1

This way, even if you have multiples of the same probability, it will always come back randomly ordered, then you just take the first entry.

Austin Hyde
As the number of rows grow, the cost of this query will grow.
Marcus Adams
This also is random weighted + random.... which alters the weights. If the WHERE RAND returned a low number say 0.001 the item with the lowest weight would be returned - along with all the other records. Then it has a 1 in (total records) of being returned. Where as if the WHERE rand was 0.9 it would return only a few records thus the weights become tilted even more towards the higher weights
Pablo
@Pablo, I agree. The WHERE clause should be dropped.
Marcus Adams
No, I am looking for weighted, but this is weighted + random. Which skews the weights.
Pablo
A: 
select * from table 
where id between 1 and 100 and ((id % 2) <> 0) 
order by NewId() 
Muhammad Kashif Nadeem
A: 

Hmm. Not entirely clear what result you want, so bear with me if this is a bit crazy. That being said, how about:

Make a new table. The table is a fixed data table, and looks like this:

Odds
====
   1
   2
   2
   3
   3
   3
   4
   4
   4
   4
etc, 
etc.

Then join from your dataset to that table on the odds column. You'll get as many rows back for each row in your table as the given odds of that row.

Then just pick one of that set at random.

Matt Gibson
A: 

A general solution, suitable for O(log(n)) updates, is something like this:

  • Store objects as leaves of a (balanced) tree.
  • At each branch node, store the weights of all objects under it.
  • When adding, removing, or modifying nodes, update weights of their parents.

Then pick a number between 0 and (total weight - 1) and navigate down the tree until you find the right object.

Since you don't care about the order of things in the tree, you can store them as an array of N pointers and N-1 numbers.

tc.