views:

124

answers:

3

I have a few hundred keys, all of the same Model, which I have pre-computed:

candidate_keys = [db.Key(...), db.Key(...), db.Key(...), ...]

Some of these keys refer to actual entities in the datastore, and some do not. I wish to determine which keys do correspond to entities.

It is not necessary to know the data within the entities, just whether they exist.

One solution would be to use db.get():

keys_with_entities = set()
for entity in db.get(candidate_keys):
  if entity:
    keys_with_entities.add(entity.key())

However this procedure would fetch all entity data from the store which is unnecessary and costly.

A second idea is to use a Query with an IN filter on key_name, manually fetching in chunks of 30 to fit the requirements of the IN pseudo-filter. However keys-only queries are not allowed with the IN filter.

Is there a better way?

A: 

At this point, the only solution I have is to manually query by key with keys_only=True, once per key.

for key in candidate_keys:
  if MyModel.all(keys_only=True).filter('__key__ =', key).count():
    keys_with_entities.add(key)

This may in fact be slower then just loading the entities in batch and discarding them, although the batch load also hammers the Data Received from API quota.

jhs
A: 

How not to do it (update based on Nick Johnson's answer):

I am also considering adding a parameter specifically for the purpose of being able to scan for it with an IN filter.

class MyModel(db.Model):
  """Some model"""
  # ... all the old stuff
  the_key = db.StringProperty(required=True) # just a duplicate of the key_name

#... meanwhile back in the example

for key_batch in batches_of_30(candidate_keys):
  key_names = [x.name() for x in key_batch]
  found_keys = MyModel.all(keys_only=True).filter('the_key IN', key_names)
  keys_with_entities.update(found_keys)

The reason this should be avoided is that the IN filter on a property sequentially performs an index scan, plus lookup once per item in your IN set. Each lookup takes 160-200ms so that very quickly becomes a very slow operation.

jhs
+2  A: 

IN filters are not supported directly by the App Engine datastore; they're a convenience that's implemented in the client library. An IN query with 30 values is translated into 30 equality queries on one value each, resulting in 30 regular queries!

Due to round-trip times and the expense of even keys-only queries, I suspect you'll find that simply attempting to fetch all the entities in one batch fetch is the most efficient. If your entities are large, however, you can make a further optimization: For every entity you insert, insert an empty 'presence' entity as a child of that entity, and use that in queries. For example:

foo = AnEntity(...)
foo.put()
presence = PresenceEntity(key_name='x', parent=foo)
presence.put()
...
def exists(keys):
  test_keys = [db.Key.from_path('PresenceEntity', 'x', parent=x) for x in keys)
  return [x is not None for x in db.get(test_keys)]
Nick Johnson
Thanks. I would point others with this issue to <a href="http://stackoverflow.com/questions/1003247/app-engine-datastore-in-operator-how-to-use">your answer to another IN operator question</a> where you explain that db.get() takes 20-40ms whereas a fetch with an IN filter requires 160-200ms **times the number of attributes in your query**.
jhs
Link to question: http://stackoverflow.com/questions/1003247/app-engine-datastore-in-operator-how-to-use
jhs
Correct me if I'm wrong but I think you might want to `return [x.parent() for x in db.get(test_keys) if x is not None]` so that you actually get the keys pointing to the actual entities. Also of note is that this solution works for keys of any Model.
jhs
Correction to my correction: `[x.key().parent() for x in db.get(test_keys) if x is not None]`
jhs
Well, in the example, exists() returns an array of booleans indicating if the corresponding value in the input array exists in the DB, while your change returns the list of keys that do exist. Which is more useful depends on what you need it for. :)
Nick Johnson