ansaurus

Question

Answer 1

A:

I didn't try it, but maybe something like this (with ? a random number from 0 to SUM(odds) - 1)?

SET @prob := 0;

SELECT
  T.*,
  (@prob := @prob + T.odds) AS prob
FROM table T
WHERE prob > ?
LIMIT 1

This is basically the same as your idea a), but entirely within one (well, technically two if you count the variable set-up) SQL commands.

Amadan 2010-08-03 13:02:06

This would be weighted towards the higher values. For example if you have values 1 and 5, only rand 0 would equal 1, but rand 2 through 4 would be 5.

Marcus Adams 2010-08-03 13:09:02

@Marcus: I do not understand you. If you have values of 1 and 5, `prob` would be 1 and 6, respectively; in case of random number being 0 (1 chance in 6), it would select the first row, and in case it is 1-5 (5 chances in 6), it would select the second. Conversely, if you have 5 and then 1, `prob` would be 5 and 6; thus random number 0-4 would select the first row (5 chances in 6) and 5 would select the second (1 chance in 6), just as desired. Where is the bias?

Amadan 2010-08-03 13:38:29

@Amadan, I don't see where the OP says that they want weighted random selection. If one of the two values has 5:1 odds, it would be biased. You don't think so?

Marcus Adams 2010-08-03 13:58:46

Ah, I misunderstood your complaint. Yes, obviously it is weighted - that's what I understood the question to mean. Specifically, try rereading the title ("random row but with odds") and his "a)" attempt at the solution (which is exactly what I've written, but in SQL).

Amadan 2010-08-03 14:05:23

@Amadan, I could be wrong, but I think the OP was complaining that the solution wasn't valid because it was weighted.

Marcus Adams 2010-08-03 14:13:26

Answer 2

+2 A:

Do some up-front work, add some columns to your table that help the selection. For example suppose you have these rows

 X  2  
 Y  3
 Z  1

We add some cumulative values

 Key Odds Start  End 
 X    2     0     1      // range 0->1, 2 values == odds
 Y    3     2     4      // range 2->4, 3 values == odds
 Z    1     5     5      // range 5->5, 1 value == odds

Start and End are chosen as follows. The first row has a start of zero. Subsequent rows have a start one more than previous end. End is the (Start + Odds - 1).

Now pick a random number R in the range 0 to Max(End)

Select * from T where R >= T.Start and R <= T.End

If the database is sufficiently clever we may we be able to use

 Select * from T where R >= T.Start and R <= (T.Start + T.Odds - 1)

I'm speculating that having an End column with an index may give the better performance. Also the Max(End) perhaps gets stashed somewhere and updated by a trigger when ncessary.

Clearly there's some hassle in updating the Start/End. This may not be too bad if either

The table contents are stable
or insertions are in someway naturally ordered, so that each new row just continues from the old highest.

djna 2010-08-03 13:15:22

This is an interesting solution. It could be improved by using 'between' rather than '>= and <='.

Pablo 2010-08-03 13:23:16

This also requires updating multiple records if one of the rows is deleted. I don't think "inserts" are an issue, as the order doesn't matter. This yields weighted results. Is that what the OP wanted?

Marcus Adams 2010-08-03 14:09:22

Yes, weighted results is what I am looking for.

Pablo 2010-08-03 14:33:14

Answer 3

A:

If you have an index on the odds column, and a primary key, this would be very efficient:

SELECT id, odds FROM table WHERE odds > 0

The database wouldn't even have to read from the table, it would get everything it needed from the odds index.

Then, you'll select a random value between 1 and the number of rows returned.

Then select that row from the array of rows returned.

Then, finally, select the whole target row:

SELECT * FROM table WHERE id = ?

This assures an even distribution between all rows with an odds value.

Alternatively, put the odds in a different table, with an autoincrement primary key.

Odds
ID     odds
1      4
2      9
3      56
4      12

Store the ID foreign key in the main table instead of the odds value, and index it.

First, get the max value. This never touches the database. It uses the index:

SELECT MAX(ID) FROM Odds

Get a random value between 1 and the max.

Then select the record.

SELECT * FROM table
JOIN Odds ON Odds.ID = table.ID
WHERE Odds.ID >= ?
LIMIT 1

This will require some maintenance if you tend to delete Odds value or roll back inserts to keep the distribution even.

There is a whole chapter on random selection in the book SQL Antipatterns.

Marcus Adams 2010-08-03 13:19:50

Answer 4

A:

What if you took your code, and added an ORDER BY RAND() and LIMIT 1?

SELECT * FROM table WHERE (100*RAND()) < odds ORDER BY RAND() LIMIT 1

This way, even if you have multiples of the same probability, it will always come back randomly ordered, then you just take the first entry.

Austin Hyde 2010-08-03 13:21:14

As the number of rows grow, the cost of this query will grow.

Marcus Adams 2010-08-03 13:23:52

This also is random weighted + random.... which alters the weights. If the WHERE RAND returned a low number say 0.001 the item with the lowest weight would be returned - along with all the other records. Then it has a 1 in (total records) of being returned. Where as if the WHERE rand was 0.9 it would return only a few records thus the weights become tilted even more towards the higher weights

Pablo 2010-08-03 13:56:19

@Pablo, I agree. The WHERE clause should be dropped.

Marcus Adams 2010-08-03 14:11:42

No, I am looking for weighted, but this is weighted + random. Which skews the weights.

Pablo 2010-08-03 14:35:05

Answer 5

A:

select * from table 
where id between 1 and 100 and ((id % 2) <> 0) 
order by NewId()

Muhammad Kashif Nadeem 2010-08-03 13:21:15

Answer 6

A:

Hmm. Not entirely clear what result you want, so bear with me if this is a bit crazy. That being said, how about:

Make a new table. The table is a fixed data table, and looks like this:

Odds
====
   1
   2
   2
   3
   3
   3
   4
   4
   4
   4
etc, 
etc.

Then join from your dataset to that table on the odds column. You'll get as many rows back for each row in your table as the given odds of that row.

Then just pick one of that set at random.

Matt Gibson 2010-08-03 13:28:44

Answer 7

A:

A general solution, suitable for O(log(n)) updates, is something like this:

Store objects as leaves of a (balanced) tree.
At each branch node, store the weights of all objects under it.
When adding, removing, or modifying nodes, update weights of their parents.

Then pick a number between 0 and (total weight - 1) and navigate down the tree until you find the right object.

Since you don't care about the order of things in the tree, you can store them as an array of N pointers and N-1 numbers.

tc. 2010-08-06 17:17:59

ansaurus

tags:

views:

answers:

Select a random row but with odds

related questions