views:

38

answers:

1

I try to get the list of images in Amazon EC2 inside the Google datastore. I want to realize this with a cron job inside the GAE.

class AmazonEC2uswest(db.Model):
    ami = db.StringProperty(required=True)
    mani = db.StringProperty()
    typ = db.StringProperty()
    arch = db.StringProperty()
    state = db.StringProperty()
    owner = db.StringProperty()

class CronAMIsAmazonUS_WEST(webapp.RequestHandler):
    def get(self):
      aws_access_key_id_admin = "<secret>"
      aws_secret_access_key_admin = "<secret>"

      conn_us_west = boto.ec2.connect_to_region('us-west-1', aws_access_key_id=aws_access_key_id_admin,
      aws_secret_access_key=aws_secret_access_key_admin, is_secure = False)

      liste_images_us_west = conn_us_west.get_all_images()

      laenge_liste_images_us_west = len(liste_images_us_west)

      for i in range(laenge_liste_images_us_west):
              datastore_uswest_AMIs = AmazonEC2uswest(ami=liste_images_us_west[i].id,
                                                      mani=str(liste_images_us_west[i].location),
                                                      typ=liste_images_us_west[i].type,
                                                      arch=liste_images_us_west[i].architecture,
                                                      state=liste_images_us_west[i].state,
                                                      owner=liste_images_us_west[i].ownerId)
              datastore_uswest_AMIs.put()

The problem: Getting the list with get_all_images() lasts only a few seconds. But writing the data to the Google datastore needs way too much CPU time.

My IBM T42p (P4M with 2GHz) needs for that piece of code approx. 1 Minute!

Is it possible to optimize my code in a way that it needs fewer CPU time?

+3  A: 

First possible optimisation: create all the entities in your loop, and then call db.put() with a list of all of them after you're finished. Something like:

entities = []
for i in range(laenge_liste_images_us_west):
    datastore_uswest_AMIs = AmazonEC2uswest(...)
    entities.append(datastore_uswest_AMIs)
db.put(entities)

or:

db.put([AmazonEC2uswest(...) for image in liste_images_us_west])

If that's still too slow, the right thing to do is probably:

  1. Get the list of images.
  2. Divide these up into small batches which can complete comfortably in under 30 seconds. So in your example which is currently taking a minute, you want at least 4 batches, maybe more, and the number of batches should depend on the number of images you get.
  3. For each batch, add a task to a task queue, specifying which images to add to the DB. This might be done by specifying all the data, or just by specifying a range of images to handle. Which you do depends on being able to store the data temporarily: there's a limit to what you can store in a task, if you go past that you could use memcache, or you could only store the image id, not all the fields. Or you could create more tasks, so that the data for each batch is under the limit.
  4. In the task handler, process just that batch. If you have all the data then great, otherwise get it again with get_all_images. Then generate and store just the entities that belong to this batch.

You don't have to use tasks, cron alone could handle it if you can remember how far you got last time the job ran, and continue from there next time. But tasks seem appropriate to me.

Steve Jessop
Hi, it's impossible to split the truckload. The EC2 API provides no ability to filter the returned list on the server side.
Neverland
@Neverland: Actually, I missed that `get_all_images()` is completing quickly and therefore is not the problem. Sorry about that, I've been rewriting my answer...
Steve Jessop
@Steve: How can I create all the entities in one pass?
Neverland
@Steve: The single-pass solution leads to another problem. After a few seconds: RequestTooLargeError: The request to API call datastore_v3.Put() was too large.
Neverland
In that case you need to compromise - put the entities in reasonable size batches. Not one at a time, and not all at once.
Nick Johnson
Try doing a few at a time, then: `db.put(entities[0:k])`, then `[k:2k]` and so on up to `[q*k:]` for some q. Adjust k to be as big as you can while still leaving reasonable room under the maximum size that works.
Steve Jessop
@Steve: For the region us-west-1 it works great when I split into batches via db.put(entities[0:k]). Here we have just 670 entries.But for the regions us-east-1 and eu-west-1 it is not working. Here I get SAXParseExceptions. I think the XML-Files are too large and the parser just stops.
Neverland
When I try to write more values into the datastore, I get another Exception: google.appengine.runtime.DeadlineExceededError I think Google is way to strict with their quotas.
Neverland
App Engine is defined with a particular purpose in mind, which is to allow apps to scale well by parallelizing and distributing work. Single operations which take a long time are a barrier to doing this, so App Engine prevents them. If you don't need that kind of distribution, then you could use a more traditional LAMP stack instead. Sure, it's annoying that cron jobs can't run long, but the task queue API is designed specifically to allow you to easily schedule the pieces of your long-running operation, once you've divided them up.
Steve Jessop