ansaurus

Question

Answer 1

+1 A:

Check out this article by Jan Kneschke... It does a great job at explaining the pros and cons of different approaches to this problem...

ircmaxell 2010-08-24 16:56:25

One of the first sentences in the article is: "For the first examples we assume the be ID is starting at 1 and we have no holes between 1 and the maximum value of the ID.". Which is exactly the problem I am facing: I **do** have holes.

Dennis Haarbrink 2010-08-24 17:04:56

Read down. There's a whole section on `adding holes to the numbers`, and `Maintaining the holes Table with Triggers`...

ircmaxell 2010-08-24 17:13:10

I see.. judged too quickly! He suggests the same approach as Paul Sasik, so I guess we're good here!

Dennis Haarbrink 2010-08-24 18:46:39

Yup. It's an excellent article (IMHO), which is why I posted it ;-)...

ircmaxell 2010-08-24 18:57:46

Answer 2

A:

You can do this efficiently, but you have to do it in two queries.

First get a random offset scaled by the number of rows that match your 5% conditions:

SELECT ROUND(RAND() * (SELECT COUNT(*) FROM MyTable WHERE ...conditions...))

This returns an integer. Next, use the integer as an offset in a LIMIT expression:

SELECT * FROM MyTable WHERE ...conditions... LIMIT 1 OFFSET ?

Not every problem must be solved in a single SQL query.

Bill Karwin 2010-08-24 16:56:27

So what you're saying, is to execute the query twice. Then skew past potentially several hundred thousand rows? I wouldn't call it efficient necessarily... The better way (Using your two query approach) would be adding a `WHERE id >= ? LIMIT 1` to the second query. That way it doesn't need to load the first n rows to find the row you want (it can short circuit that step and apply significant optimizations)...

ircmaxell 2010-08-24 16:59:42

The problem with this approach is that you can only retrieve one record at a time. When you increase your `limit` it's not a random set anymore, it's just the offset that is random.

Dennis Haarbrink 2010-08-24 17:00:05

@ircmaxell: The offset is not an id value. Using `WHERE id >= ?` is non-random, rows that occur after gaps are picked with greater frequency.

Bill Karwin 2010-08-24 18:11:04

@Dennis Haarbrink: Right, this technique is useful for picking one random row at a time.

Bill Karwin 2010-08-24 18:13:32

Answer 3

+1 A:

You could solve this with some denormalization:

Build a secondary table that contains the same pkeys and statuses as your data table
Add and populate a status group column which will be a kind of sub-pkey that you auto number yourself (1-based autoincrement relative to a single status)

Pkey    Status    StatusPkey
1       A         1
2       A         2
3       B         1
4       B         2
5       C         1
...     C         ...
n       C         m (where m = # of C statuses)

When you don't need to filter you can generate rand #s on the pkey as you mentioned above. When you do need to filter then generate rands against the StatusPkeys of the particular status you're interested in.

There are several ways to build this table. You could have a procedure that you run on an interval or you could do it live. The latter would be a performance hit though since the calculating the StatusPkey could get expensive.

Paul Sasik 2010-08-24 16:58:01

This approach also came to mind, but thought of it as being too expensive. I didn't realize then that it would occupy only a maximum of 5% of the total (I thought it would be greater). So I think this is the approach I will take. I will create a background process for this though, don't want that performance hit in my frontend.

Dennis Haarbrink 2010-08-24 17:29:33

How about populating (or recreating) the table on an interval? Once a day or once an hour? Or does it need to be near-live?

Paul Sasik 2010-08-24 18:04:16

@Paul: That's wat I intend to do, I'm going to install a crontab for this purpose. I think I can keep the update frequency fairly high (say, every five minutes) given that I can register when a status changes. So all I have to do is check the 'dirty' flag and repopulate the table! (I might even be able to pull off doing just updates if I keep track of the exact status updates).

Dennis Haarbrink 2010-08-24 18:39:59

ansaurus

tags:

views:

answers:

randomizing large dataset

related questions