views:

45

answers:

1

In my App Engine datastore I've got an entity type which may hold a large number of entities, each of which will have the property 'customer_id'. For example, lets say a given customer_id has 10,000 entities, and there are 50,000 customer_ids.

I'm trying to filter this effectively, so that a user could get information for at least 2000 customer_ids at one time. That is, read them out to the UI within the 30 second time out limit (further filtering will be done at the front end, so the user isn't bombarded with all results at once).

Below I've listed a view of my current datastore models. 'Reports' refer to sets of customer_ids, so continuing the above example, I could get my 2000 customer_ids from ReportCids.

class Users(db.Model):
  user = db.StringProperty()
  report_keys_list = db.ListProperty(db.Key)

class Reports(db.Model):
  #report_key
  report_name = db.StringProperty()

class ReportCids(db.Model):
  report_key_reference = db.ReferenceProperty(Reports, collection_name="report_cid_set")
  customer_id = db.IntegerProperty()
  start_timestamp = db.IntegerProperty()
  end_timestamp = db.IntegerProperty()

class CustomerEvent(db.Model):
  customer_id = db.IntegerProperty()
  timestamp = db.IntegerProperty()
  event_type =  db.IntegerProperty()

The options I considered:
-Perform a separate query for each customer_in in my set of 2000
-Use lists of keys indicating customer events, but this is limited to 5000 entries in a list (so I've read)
-Get all entries, and filter in my code

I'd really appreciate if anyone had some advice on how to do this in the most efficient way, or if I'm approaching the problem in completely the wrong way. I'm a novice in terms of using the datastore effectively.

Of course happy to provide any clarification or information if it helps.
Thanks very much!

A: 

Hi Guys,

Thanks for getting back to me. It looks like I had an issue with the account used when I posted so I need to respond in the comments here.

Having thought about it and from what you've said, fetching that many results is not going to work.

Here's what I'm trying to do: I'm trying to make a report that shows for multiple Customer IDs the events that happened for that group of customers. So lets say I have a report to view information for 2000 customers. I want to be able to get all events (CustomerEvent) and then filter this by event_type. I'm likely asking a lot here, but what I was hoping to do was get all of these events for 2000 customers, and then do the event_type filtering at the front end, so that a user could dynamically adjust the event_type they want to look for and get some information on successful actions for that event type.

So my main problem is getting the right entities out of CustomerEvent effectively.

Right now, I'm grabbing a list of Customer IDs like this:

cid_list = []  
this_report = models.Reports.get(report_key)  
if this_report.report_cid_set:  
  for entity in this_report.report_cid_set:  
    cid_list.append(entity.customer_id)  

My estimation of 10,000 CustomerEvent entities was quite high, but theoretically it could happen. Perhaps when I go to get the report results, I could filter straight away by the event_type specified by the user. This means that I have to go back to the datastore each time they choose a new option which isn't ideal, but perhaps it's my only option given this set up.

Thanks very much for taking the time to look at this!

oli
How are you going to present 20 million records to the user in a comprehensible fashion? Presumably this will involve some sort of aggregation - in which case you need to do the aggregation at write-time, and retrieve the pre-aggregated data.
Nick Johnson