ansaurus

Question

How to generate large files (PDF and CSV) using AppEngine and Datastore?

Answer 1

+1 A:

Some time ago I faced the same problem with GAE. After many attempts I just moved to another web hosting since I could do it. Nevertheless, before moving I had 2 ideas how to resolve it. I haven't implemented them, but you may try to.

First idea is to use SOA/RESTful service on another server, if it is possible. You can even create another application on GAE in Java, do all the work there (I guess with Java's PDFBox it will take much less time to generate PDF), and return result to Python. But this option needs you to know Java and also to divide your app to several parts with terrible modularity.

So, there's another approach: you can create a "ping-pong" game with a user's browser. The idea is that if you cannot make everything in a single request, force browser to send you several. During first request make only a part of work which fits 30 seconds limit, then save the state and generate 'ticket' - unique identifier of a 'job'. Finally, send the user response which is simple page with redirect back to your app, parametrized by a job ticket. When you get it. just restore state and proceed with the next part of job.

Andrei 2010-10-15 02:39:40

Great suggestions - just don't know if I have the time to implement them. I'm working on figuring out a few different options. I'll keep you updated!

etc 2010-10-15 08:19:30

BTW, the only problem with your last approach is the final generation of the pdf file, which has to happen in one process. It seems to be timing out when there are more than 1500 records. :/

etc 2010-10-15 08:20:03

Through I'm not familiar with ReportLab, I guess you can generate several parts of PDF separately and then concatenate them. Even in case you cannot merge several tables, you still can make a table of tables, which will seem the same.

Andrei 2010-10-15 09:45:09

Answer 2

+2 A:

Here is one quick thought, assuming it is crapping out fetching from the datastore. You could use tasks and cursors to fetch the data in smaller chunks, then do the generation at the end.

Start a task which does the initial query and fetches 300 (arbitrary number) records, then enqueues a named(!important) task that you pass the cursor to. That one in turn queries [your arbitrary number] records, and then passes the cursor to a new named task as well. Continue that until you have enough records.

Within each task process the entities, then store the serialized result in a text or blob property on a 'processing' model. I would make the model's key_name the same as the task that created it. Keep in mind the serialized data will need to be under the API call size limit.

To serialize your table pretty fast you could use:

serialized_data = "\x1e".join("\x1f".join(voter) for voter in data)

Have the last task (when you get enough records) kick of the PDf or CSV generation. If you use key_names for you models you, should be able to grab all of the entities with encoded data by key. Fetches by key are pretty fast, you'll know the model's keys since you know the last task name. Again, you'll want to be mindful size of your fetches from the datastore!

To deserialize:

list(voter.split('\x1f') for voter in serialized_data.split('\x1e'))

Now run your PDF / CSV generation on the data. If splitting up the datastore fetches alone does not help you'll have to look into doing more of the processing in each task.

Don't forget in the 'build' task you'll want to raise an exception if any of the interim models are not yet present. Your final task will automatically retry.

Robert Kluin 2010-10-15 02:49:31

Wouldn't cycling through the results with a limit essentially be the same thing? I don't believe it's the data pull that's generating the timeout - although you could be correct. If I remember correctly, there is a 30 second timeout on the process, but a 10 second timeout on requests. 30 seconds should be more than enough time to actually process 1500-2000 records and output them in a PDF. This is timing out a lot quicker than that.

etc 2010-10-15 05:05:17

My suggestion here was meant to partially separate the data retrieval from the processing. On App Engine fetching 1,500 entities could *easily* consume several *seconds* of processing time. So yes, if you have not already tried I would definitely try running this with a limit of 10. Also, if you have not already, you should use Appstats (http://code.google.com/appengine/docs/python/tools/appstats.html) and try to determine exactly what is taking so much time.

Robert Kluin 2010-10-15 05:16:18

Great idea - that just got released with the new version didn't it? Haven't tried it out yet, but I'm going to do it now.

etc 2010-10-15 06:06:24

For what it's worth, it looks like appstats only logs requests that complete...so timeouts don't seem to get logged. Sucks for debugging intense processes :(

etc 2010-10-15 06:41:23

BTW: I created a loop so that it was only pulling 100 records at a time, and it seemed to get past pulling the records, however, it times out on the actual creation of the pdf...while it's calculating height etc, it poops out. I'm going to keep playing with it, and see if maybe manually setting the height works. I'll report back with the results.

etc 2010-10-15 08:16:05

You do need to use cursors - fetches with an offset have to internally skip over the first records, which is inefficient. This is why cursors were provided in the first place.

Nick Johnson 2010-10-15 09:19:30

Appstats has been around for some time, but they made it part of the SDK a couple releases ago. I thought it would only log data for a completed request, but if you make the number of records small you should still get valuable data about what is taking all the time. As Nick said, you'll want to use cursors for fetching entities across runs. You also want to pull as many records as you can at a time, it will be faster since you will minimize external calls.

Robert Kluin 2010-10-15 16:49:06

I've changed my code to utilize cursors instead of offsets. I'm going to use a version of your "cacheing" idea, but with memcache instead of the datastore. I'll let you know how it works, and post the code for reference.

etc 2010-10-17 05:26:54

I've been thinking - is this something that I should attempt to use mapreduce for, or is queueing really the best bet. Isn't mapreduce used mainly for datastore manipulations? If I'm using memcache to cache the items, would it be better to just chain tasks, like you suggest? Also, two more questions. Would it be most efficient if we were to preload the key names of the records we want to access into a datastore list using mapreduce, and then prime the cache using task queues? How would you keep the session going for the end user? Keep redirecting them instead of queueing them in the bg?

etc 2010-10-17 05:44:38

I would personally start with task chaining, I think it will be easier to get a working solution. using memcache is a good idea. However, be careful; I would back it up with the datastore. Otherwise you could lose a chunk of your data and have to regenerate it. If you know key_names in advance, you could pull them more efficiently. I would just start as simple as you can, then look for ways to optimize. Perhaps you could redirect the user to a status page that gives them some feedback and eventually a link to the final report (or report generation handler).

Robert Kluin 2010-10-17 06:02:35

I've uploaded an image with three potential solutions. What do you think? For solution 3 to work, I would have to generate a list of the key_names (potentially could do this with mapreduce)...is there a better way to retrieve just the keynames of applicable records? I guess I could queue up a "key generation" task, and then use another task to actually load the data. I can always programmatically trigger mapreduce routines. How would I "back it up" with datastore? Possibly compare a key list, and pull any records from the datastore that don't have an accompanying memcache element?

etc 2010-10-17 06:47:06

BTW, thank you very much for all your help. Hopefully once I figure out a good solution, I can post and share the source code for others to use. It seems like this is something that many people would need solutions for.

etc 2010-10-17 06:47:56

The Mapper API currently has an important limitations -- your stuck going over everything. So if you have a usable index, you'll likely be better off using cursors. It totally eliminates an unneeded step. I would not use A, it is not clear it buys you anything other than risk of cache misses. A combination of B and C might work if you have to go over *every* record anyway. You could make a small twist on B, check into incrementally generating your actual output. CSV is easy, just adapt my suggested serializer. For PDF maybe pickle could help you store interim parts of the PDF?

Robert Kluin 2010-10-17 20:15:07

Good point regarding the mapper API. I wonder if there is any way to limit down running the mapper with a query. The reason A was thought of is because the performance benefits of pulling pre "cached" versions of the records from memcache. Thus the datastore loading is done in the background, and the actual generation of the report done from memcache. What is pickle?

etc 2010-10-19 02:48:48

The biggest issue using memcache is that there *will* (at some point) be cache misses, so you'll need to design for that. You would probably need to dig into the Mapper source to make it accept a user specified query. Pickle (http://docs.python.org/library/pickle.html) serializes and deserializes Python objects. Perhaps you could use it to store partially generated ReportLab Table's, or something that could speed up the final build step.

Robert Kluin 2010-10-19 03:43:41

ansaurus

tags:

views:

answers:

How to generate large files (PDF and CSV) using AppEngine and Datastore?

related questions