views:

72

answers:

3

Let's say I had a table full of records that I wanted to pull random records from. However, I want certain rows in that table to appear more often than others (and which ones vary by user). What's the best way to go about this, using SQL?

The only way I can think of is to create a temporary table, fill it with the rows I want to be more common, and then pad it with other randomly selected rows from the table. Is there a better way?

+4  A: 

One way I can think of is to create another column in the table which is a rolling sum of your weights, then pull your records by generating a random number between 0 and the total of all your weights, and pull the row with the highest rolling sum value less than the random number.

For example, if you had four rows with the following weights:

+---+--------+------------+
|row| weight | rollingsum |
+---+--------+------------+
| a |      3 |          3 |
| b |      3 |          6 |
| c |      4 |         10 |
| d |      1 |         11 |  
+---+--------+------------+

Then, choose a random number n between 0 and 11, inclusive, and return row a if 0<=n<3, b if 3<=n<6, and so on.

Here are some links on generating rolling sums:

http://dev.mysql.com/tech-resources/articles/rolling_sums_in_mysql.html

http://dev.mysql.com/tech-resources/articles/rolling_sums_in_mysql_followup.html

Marquis Wang
Woah, that is awesome! Thank you, it seems like this is the best approach. I'll have to combine it with views, as Shiraz suggested, since I want the weights to differ for each user (would that be a problem?), but otherwise that should work great.
Paul
A: 

I don't know that it can be done very easily with SQL alone. With T-SQL or similar, you could write a loop to duplicate rows, or you can use the SQL to generate the instructions for doing the row duplication instead.

I don't know your probability model, but you could use an approach like this to achieve the latter. Given these table definitions:

RowSource
---------
RowID

UserRowProbability
------------------
UserId
RowId
FrequencyMultiplier

You could write a query like this (SQL Server specific):

SELECT TOP 100 rs.RowId, urp.FrequencyMultiplier
FROM RowSource rs
  LEFT JOIN UserRowProbability urp ON rs.RowId = urp.RowId
ORDER BY ISNULL(urp.FrequencyMultiplier, 1) DESC, NEWID()

This would take care of selecting a random set of rows as well as how many should be repeated. Then, in your application logic, you could do the row duplication and shuffle the results.

Jacob
A: 

Start with 3 tables users, data and user-data. User-data contains which rows should be prefered for each user.

Then create one view based on the data rows that are prefered by the the user.

Create a second view that has the none prefered data.

Create a third view which is a union of the first 2. The union should select more rows from the prefered data.

Then finally select random rows from the third view.

Shiraz Bhaiji