views:

753

answers:

3

I'm developing software using the google app engine.

I have some considerations about the optimal design regarding the following issue: I need to create and save snapshots of some entities at regular intervals.

in the conventional relational db world, I would create db jobs which would insert new summary records.

for example, a job would insert a record for every active user that would contain his current score to the "userrank" table, say, every hour.

I'd like to know what's the best method to achieve this in google app engine. I know that there is the Cron service, but does it allow us to execute jobs which will insert/update thousands of records?

+2  A: 

Have you considered using the remote api instead? This way you could get a shell to your datastore and avoid the timeouts. The Mapper class they demonstrate in that link is quite useful and I've used it successfully to do batch operations on ~1500 objects.

That said, cron should work fine too. You do have a limit on the time of each individual request so you can't just chew through them all at once, but you can use redirection to loop over as many users as you want, processing one user at a time. There should be an example of this in the docs somewhere if you need help with this approach.

Kiv
thank you, Kiv. consider this scenerio: I have 5000 users, and a Cron task before timing-out can process 100 users. that means that I'll have to call the Cron url for 50 times. Will be I in Google's wanted list if I do that, and as important as that, is it good practice? and regarding the remote api, do you have any suggestions on how to do scheduling for remote api execution?
shanyu
I don't think that would be a problem. The quotas they give you are very generous, so just keep an eye on your quota details to make sure you're not going over.
Kiv
One way to schedule the remote api execution is just to have your local machine run a normal (non-App Engine) cron task that just runs your remote api script. This would require that you have your local machine on all the time, though.
Kiv
thank you, again. we'll subscribe to GAE once we have the app up and running. I guess being a payer does not change Cron timeout limits? BTW, the GAE roadmap features task queues for background processing, I wonder if it is what I need.
shanyu
I would not worry about the Cron timeout limit. As long as you can process even one user in the space of a request, you sbould be able to call the Cron url 5000 times an hour without having any issue. It's well within your quota, and their systems are designed to handle much more than that load.
Kiv
so I figure that I'll modify the code so that it'll call itself again and again with a bookmark parameter till the job is completed. thanks..
shanyu
+3  A: 

I think you'll find that snapshotting every user's state every hour isn't something that will scale well no matter what your framework. A more ordinary environment will disguise this by letting you have longer running tasks, but you'll still reach the point where it's not practical to take a snapshot of every user's data, every hour.

My suggestion would be this: Add a 'last snapshot' field, and subclass the put() function of your model (assuming you're using Python; the same is possible in Java, but I don't know the syntax), such that whenever you update a record, it checks if it's been more than an hour since the last snapshot, and if so, creates and writes a snapshot record.

In order to prevent concurrent updates creating two identical snapshots, you'll want to give the snapshots a key name derived from the time at which the snapshot was taken. That way, if two concurrent updates try to write a snapshot, one will harmlessly overwrite the other.

To get the snapshot for a given hour, simply query for the oldest snapshot newer than the requested period. As an added bonus, since inactive records aren't snapshotted, you're saving a lot of space, too.

Nick Johnson
+1, this would certainly be more efficient.
Kiv
good point, but here is a scenerio: suppose the application simulates a financial market. every user holds a portfolio of contracts. the market value of the portfolio of a user changes not only because he trades, but because all others trade. so the current value of the user's portfolio is not associated with his actions. what can be done for this case?
shanyu
Fair enough. In this situation, I presume it doesn't matter when the snapshot actually runs, though - you can generate past snapshots based on historical data when needed. In this case, I think waiting for the background processing support that's on the roadmap will be your best bet. Cron jobs won't work entirely as Kiv suggests, as you can't have multiple 'minutely' crons, or crons with a period of less than a minute. I would suggest generating and storing a user's historical data on first request, but I'm not positive you can do that fast enough - depends how big their portfolio is.
Nick Johnson
A: 

I would use a combination of Cron jobs and a looping url fetch method detailed here: http://stage.vambenepe.com/archives/549. In this way you can catch your timeouts and begin another request.

To summarize the article, the cron job calls your initial process, you catch the timeout error and call the process again, masked as a second url. You have to ping between two URLs to keep app engine from thinking you are in a accidental loop. You also need to be careful that you do not loop infinitely. Make sure that there is an end state for your updating loop, since this would put you over your quotas pretty quickly if it never ended.

elkelk