views:

1312

answers:

5

Hi all,

I'm trying to write a GQL query that returns N random records of a specific kind. My current implementation works but requires N calls to the datastore. I'd like to make it 1 call to the datastore if possible.

I currently assign a random number to every kind that I put into the datastore. When I query for a random record I generate another random number and query for records > rand ORDER BY asc LIMIT 1.

This works, however, it only returns 1 record so I need to do N queries. Any ideas on how to make this one query? Thanks.

+2  A: 

A single call to datastore can only return a set of consecutive rows from some index. This is why some GQL queries, including any use of !=, expand to multiple datastore calls.

N independent uniform random selections are not (in general) consecutive in any index.

QED.

You could probably use memcache to store the entities, and reduce the cost of grabbing N of them. Or if you don't mind the "random" selections being close together in the index, select a randomly-chosen block of (say) 100 in one query, then pick N at random from those. Since you have a field that's already randomised, it won't be immediately obvious to an outsider that the N items are related. At least, not until they look at a lot of samples and notice that items A and Z never appear in the same group, because they're more than 100 apart in the randomised index. And if performance permits, you can re-randomise your entities from time to time.

Steve Jessop
Thanks - I do really need randomized results so I guess I will have to use multiple datastore calls. I'll try to minimize N as much as I can I guess.
aloo
A: 

Looks like the only method is by storing the random integer value in each entity's special property and querying on that. This can be done quite automatically if you just add an automatically initialized property.

Unfortunately this will require processing of all entities once if your datastore is already filled in.

It's weird, I know.

Slava Tutushkin
+2  A: 

What kind of tradeoffs are you looking for? If you are willing to put up with a small performance hit on inserting these entities, you can create a solution to get N of them very quickly.

Here's what you need to do:

When you insert your Entities, specify the key. You want to give keys to your entities in order, starting with 1 and going up from there. (This will require some effort, as app engine doesn't have autoincrement() so you'll need to keep track of the last id you used in some other entity, let's call it an IdGenerator)

Now when you need N random entities, generate N random numbers between 1 and whatever the last id you generated was (your IdGenerator will know this). You can then do a batch get by key using the N keys, which will only require one trip to the datastore, and will be faster than a query as well, since key gets are generally faster than queries, AFAIK.

This method does require dealing with a few annoying details:

  1. Your IdGenerator might become a bottleneck if you are inserting lots of these items on the fly (more than a few a second), which would require some kind of sharded IdGenerator implementation. If all this data is preloaded, or is not high volume, you have it easy.
  2. You might find that some Id doesn't actually have an entity associated with it anymore, because you deleted it or because a put() failed somewhere. If this happened you'd have to grab another random entity. (If you wanted to get fancy and reduce the odds of this you could make this Id available to the IdGenerator to reuse to "fill in the holes")

So the question comes down to how fast you need these N items vs how often you will be adding and deleting them, and whether a little extra complexity is worth that performance boost.

Peter Recore
You can more or less implement this using App Engine's built in numbering for IDs - if you know the maximum ID, you can pick some uniformly at random. Some won't exist, so retry them with new random ids, and so forth. If your ID space is dense, this will work fine.
Nick Johnson
sweet. I didn't know we could rely on the built in numbering to start at 1 and go up 1 by 1 from there.
Peter Recore
You can't - but it will allocate in blocks, and as long as the blocks get _mostly_ used, your retries should be small enough to be manageable.
Nick Johnson
A: 

I just had the same problem. I decided not to assign IDs to my already existing entries in datastore and did this, as I already had the totalcount from a sharded counter.

This selects "count" entries from "totalcount" entries, sorted by key.

    # select $count from the complete set
    numberlist = random.sample(range(0,totalcount),count)
    numberlist.sort()

    pagesize=1000

    #initbuckets
    buckets = [ [] for i in xrange(int(max(numberlist)/pagesize)+1) ]
    for k in numberlist:
        thisb = int(k/pagesize)
        buckets[thisb].append(k-(thisb*pagesize))
    logging.debug("Numbers: %s. Buckets %s",numberlist,buckets)

    #page through results.

    result = []
    baseq =  db.Query(MyEntries,keys_only=True).order("__key__")
    for b,l in enumerate(buckets):
        if len(l) > 0: 
            result += [ wq.fetch(limit=1,offset=e)[0] for e in l ]

        if b < len(buckets)-1: # not the last bucket
            lastkey  = wq.fetch(1,pagesize-1)[0]
            wq = baseq.filter("__key__ >",lastkey)

Beware that this to me is somewhat complex, and I'm still not conviced that I dont have off-by-one or off-by-x errors.

And beware that if count is close to totalcount this can be very expensive. And beware that on millions of rows it might not be possible to do within appengine time boundaries.

svrist
+1  A: 

I agree to the answer from Steve, there is no such way to retrieve N random rows in one query.

However, even the method of retrieving one single entity does not usually work such that the prbability of the returned results is evenly distributed. The probability of returning a given entity depends on the gap of it's randomly assigned number and the next higher random number. E.g. if random numbers 1,2, and 10 have been assigned (and none of the numbers 3-9), the algorithm will return "2" 8 times more often than "1".

I have fixed this in a slightly more expensice way. If someone is interested, I am happy to share

Dirk