views:

62

answers:

2

There are a few hundred of book records in the database and each record has a publish time. In the homepage of the website, I am required to write some codes to randomly pick 10 books and put them there. The requirement is that newer books need to have higher chances of getting displayed.

Since the time is an integer, I am thinking like this to calculate the probability for each book:

Probability of a book to be drawn = (current time - publish time of the book) / ((current time - publish time of the book1) +  (current time - publish time of the book1) + ... (current time - publish time of the bookn))

After a book is drawn, the next round of the loop will minus the (current time - publish time of the book) from the denominator and recalculate the probability for each of the remaining books, the loop continues until 10 books have been drawn.

Is this algorithm a correct one?

By the way, the website is written in PHP.

Feel free to suggest some PHP codes if you have a better algorithm in your mind.

Many thanks to you all.

+1  A: 

First off I think your formula will guarantee that earlier books get picked. Try to set your initial probabilities based on:

Age - days since publication

Max(Age) - oldest book in the sample

Book Age(i) - age of book i

... Prob (i) = [Max (age) + e - Book Age (i)] / sum over all i [ Max (age) + e - Book age(i) ]

The value e ensures that the oldest book has some probability of being selected. Now that that is done, you can always recalc the prob of any sample.

Now you have to find an UNBIASED way of picking books. Probably the best way would be to calculate the cumulative distribution using the above then pick a uniform (0,1) r.v. Find where that r.v. is in the cumulative distribution and pick the book nearest to it.

Can't help you on the coding. Make sense?

Grembo
+1  A: 

Here's a very similar question that may help: http://stackoverflow.com/questions/56692/random-weighted-choice The solution is in C# but the code is very readable and close to PHP syntax so it should be easy to adapt.

For example, here's how one could do this in MySQL:

First calculate the total age of all books and store it in a MySQL user variable:

SELECT SUM(TO_DAYS(CURDATE())-TO_DAYS(publish_date)) FROM books INTO @total;

Then choose books randomly, weighted by their age:

SELECT book_id FROM (
  SELECT book_id, TO_DAYS(CURDATE())-TO_DAYS(publish_date) AS age FROM books
) b
WHERE book_id NOT IN (...list of book_ids chosen so far...)
  AND RAND()*@total < b.age AND (@total:[email protected])
ORDER BY b.publish_date DESC
LIMIT 10;

Note that the @total decreases only if a book has passed the random-selection test, because of short-circuiting of AND expressions.

This is not guaranteed to choose 10 books in one pass -- it's not even guaranteed to choose any books on a given pass. So you have to re-run the second step until you've found 10 books. The @total variable retains its decreased value so you don't have to recalculate it.

Bill Karwin