ansaurus

Question

Try to fill the GAE datastore but the code consumes to much cpu time. How to optimize this?

Answer 1

+3 A:

First possible optimisation: create all the entities in your loop, and then call db.put() with a list of all of them after you're finished. Something like:

entities = []
for i in range(laenge_liste_images_us_west):
    datastore_uswest_AMIs = AmazonEC2uswest(...)
    entities.append(datastore_uswest_AMIs)
db.put(entities)

or:

db.put([AmazonEC2uswest(...) for image in liste_images_us_west])

If that's still too slow, the right thing to do is probably:

Get the list of images.
Divide these up into small batches which can complete comfortably in under 30 seconds. So in your example which is currently taking a minute, you want at least 4 batches, maybe more, and the number of batches should depend on the number of images you get.
For each batch, add a task to a task queue, specifying which images to add to the DB. This might be done by specifying all the data, or just by specifying a range of images to handle. Which you do depends on being able to store the data temporarily: there's a limit to what you can store in a task, if you go past that you could use memcache, or you could only store the image id, not all the fields. Or you could create more tasks, so that the data for each batch is under the limit.
In the task handler, process just that batch. If you have all the data then great, otherwise get it again with get_all_images. Then generate and store just the entities that belong to this batch.

You don't have to use tasks, cron alone could handle it if you can remember how far you got last time the job ran, and continue from there next time. But tasks seem appropriate to me.

Steve Jessop 2010-03-28 12:28:57

Hi, it's impossible to split the truckload. The EC2 API provides no ability to filter the returned list on the server side.

Neverland 2010-03-28 12:35:30

@Neverland: Actually, I missed that `get_all_images()` is completing quickly and therefore is not the problem. Sorry about that, I've been rewriting my answer...

Steve Jessop 2010-03-28 12:41:20

@Steve: How can I create all the entities in one pass?

Neverland 2010-03-28 12:44:16

@Steve: The single-pass solution leads to another problem. After a few seconds: RequestTooLargeError: The request to API call datastore_v3.Put() was too large.

Neverland 2010-03-28 13:12:37

In that case you need to compromise - put the entities in reasonable size batches. Not one at a time, and not all at once.

Nick Johnson 2010-03-28 13:37:36

Try doing a few at a time, then: `db.put(entities[0:k])`, then `[k:2k]` and so on up to `[q*k:]` for some q. Adjust k to be as big as you can while still leaving reasonable room under the maximum size that works.

Steve Jessop 2010-03-28 13:39:20

@Steve: For the region us-west-1 it works great when I split into batches via db.put(entities[0:k]). Here we have just 670 entries.But for the regions us-east-1 and eu-west-1 it is not working. Here I get SAXParseExceptions. I think the XML-Files are too large and the parser just stops.

Neverland 2010-03-28 14:51:58

When I try to write more values into the datastore, I get another Exception: google.appengine.runtime.DeadlineExceededError I think Google is way to strict with their quotas.

Neverland 2010-03-28 15:11:26

App Engine is defined with a particular purpose in mind, which is to allow apps to scale well by parallelizing and distributing work. Single operations which take a long time are a barrier to doing this, so App Engine prevents them. If you don't need that kind of distribution, then you could use a more traditional LAMP stack instead. Sure, it's annoying that cron jobs can't run long, but the task queue API is designed specifically to allow you to easily schedule the pieces of your long-running operation, once you've divided them up.

Steve Jessop 2010-03-28 16:04:22

ansaurus

tags:

views:

answers:

Try to fill the GAE datastore but the code consumes to much cpu time. How to optimize this?

related questions