views:

72

answers:

2

I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a boost factor.

So I essentially have the fields ID and BOOST.The boost is calculated in a way that it turns out as an integer that represents the percentage of how often this entry should be hit in in comparison.

ID  Boost
1   1
2   2
3   7

So if I run my random function indefinitely I should end up with X hits on ID 1, twice as much on ID 2 and 7 times as much on ID 3.

So every hit should be random but with a probability of (boost / sum of boosts). So the probability for ID 3 in this example should be 0.7 (because the sum is 10. I choose those values for simplicity).

I thought about something like the following query:

SELECT id FROM table WHERE CEIL(RAND() * MAX(boost)) >= boost ORDER BY rand();

Unfortunately that doesn't work, after considering the following entries in the table:

ID  Boost
1   1
2   2

It will, with a 50/50 chance, have only the 2nd or both elements to choose from randomly.

So 0.5 hit goes to the second element And 0.5 hit goes to the (second and first) element which is chosen from randomly so so 0.25 each. So we end up with a 0.25/0.75 ratio, but it should be 0.33/0.66

I need some modification or new a method to do this with good performance.

I also thought about storing the boost field cumulatively so I just do a range query from (0-sum()), but then I would have to re-index everything coming after one item if I change it or develop some swapping algorithm or something... but that's really not elegant and stuff.

Both inserting/updating and selecting should be fast!

Do you have any solutions to this problem?

The best use case to think of is probably advertisement delivery. "Please choose a random ad with given probability"... however i need it for another purpose but just to give you a last picture what it should do.

edit:

Thanks to kens answer i thought about the following approach:

  1. calculate a random value from 0-sum(distinct boost)

    SET @randval = (select ceil(rand() * sum(DISTINCT boost)) from test);

  2. select the boost factor from all distinct boost factors which added up surpasses the random value

then we have in our 1st example 1 with a 0.1, 2 with a 0.2 and 7 with a 0.7 probability.

  1. now select one random entry from all entries having this boost factor

PROBLEM: because the count of entries having one boost is always different. For example if there is only 1-boosted entry i get it in 1 of 10 calls, but if there are 1 million with 7, each of them is hardly ever returned... so this doesnt work out :( trying to refine it.

I have to somehow include the count of entries with this boost factor ... but i am somehow stuck on that...

+2  A: 

I dare to suggest straightforward solution with two queries, using cumulative boost calculation.

First, select sum of boosts, and generate some number between 0 and boost sum:

select ceil(rand() * sum(boost)) from table;

This value should be stored as a variable, let's call it {random_number}

Then, select table rows, calculating cumulative sum of boosts, and find the first row, which has cumulative boost greater than {random number}:

SET @cumulative_boost=0;
SELECT
  id,
  @cumulative_boost:=(@cumulative_boost + boost) AS cumulative_boost,
FROM
  table
WHERE
  cumulative_boost >= {random_number}
ORDER BY id
LIMIT 1;
Kel
i like the approach but this whole adding up is exactly the thing i wanted to avoid. but +1 because you may have pointed me into the right direction. i am just working on a slution where i just add up the boosts commulative and then know exactly what boost value to choose randomly from. havent thought it through completely yet but i think it will do.
Joe Hopfgartner
It would be good if you publish your solution here when you work it out, since this problem looks rather interesting.
Kel
didnt work out unfortunately but i am trying to refine it. there must be a way this is such a common problem!
Joe Hopfgartner
+3  A: 

You need to generate a random number per row and weight it.

In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost

DECLARE @sample TABLE (id int, boost int)

INSERT @sample VALUES (1, 1), (2, 2), (3, 7)

SELECT
    RAND(CHECKSUM(NEWID())) * boost  AS weighted,
    SUM(boost) OVER () AS boostcount,
    id
FROM
    @sample
GROUP BY
    id, boost
ORDER BY
    weighted DESC

If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.

Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.

This sample was put together on SQL Server 2008, BTW

gbn
+1 looks promising, gotta port it to mysql and think it through once more b4 accepting :)
Joe Hopfgartner