views:

88

answers:

2

When I first started developing this project, there was no requirement for generating large files, however it is now a deliverable.

Long story short, GAE just doesn't play nice with any large scale data manipulation or content generation. The lack of file storage aside, even something as simple as generating a pdf with ReportLab with 1500 records seems to hit a DeadlineExceededError. This is just a simple pdf comprised of a table.

I am using the following code:

    self.response.headers['Content-Type'] = 'application/pdf'
    self.response.headers['Content-Disposition'] = 'attachment; filename=output.pdf'
    doc = SimpleDocTemplate(self.response.out, pagesize=landscape(letter))

    elements = []

    dataset = Voter.all().order('addr_str')

    data = [['#', 'STREET', 'UNIT', 'PROFILE', 'PHONE', 'NAME', 'REPLY', 'YS', 'VOL', 'NOTES', 'MAIN ISSUE']]

    i = 0
    r = 1
    s = 100

    while ( i < 1500 ):
        voters = dataset.fetch(s, offset=i)
        for voter in voters:
            data.append([voter.addr_num, voter.addr_str, voter.addr_unit_num, '', voter.phone, voter.firstname+' '+voter.middlename+' '+voter.lastname ])
            r = r + 1
        i = i + s

    t=Table(data, '', r*[0.4*inch], repeatRows=1 )
    t.setStyle(TableStyle([('ALIGN',(0,0),(-1,-1),'CENTER'),
                           ('INNERGRID', (0,0), (-1,-1), 0.15, colors.black),
                           ('BOX', (0,0), (-1,-1), .15, colors.black),
                           ('FONTSIZE', (0,0), (-1,-1), 8)
                           ]))

    elements.append(t)

    doc.build(elements) 

Nothing particularly fancy, but it chokes. Is there a better way to do this? If I could write to some kind of file system and generate the file in bits, and then rejoin them that might work, but I think the system precludes this.

I need to do the same thing for a CSV file, however the limit is obviously a bit higher since it's just raw output.

    self.response.headers['Content-Type'] = 'application/csv'
    self.response.headers['Content-Disposition'] = 'attachment; filename=output.csv'

    dataset = Voter.all().order('addr_str')

    writer = csv.writer(self.response.out,dialect='excel')
    writer.writerow(['#', 'STREET', 'UNIT', 'PROFILE', 'PHONE', 'NAME', 'REPLY', 'YS', 'VOL', 'NOTES', 'MAIN ISSUE'])

    i = 0
    s = 100
    while ( i < 2000 ):
        last_cursor = memcache.get('db_cursor')
        if last_cursor:
            dataset.with_cursor(last_cursor)
        voters = dataset.fetch(s)
        for voter in voters:
            writer.writerow([voter.addr_num, voter.addr_str, voter.addr_unit_num, '', voter.phone, voter.firstname+' '+voter.middlename+' '+voter.lastname])
        memcache.set('db_cursor', dataset.cursor())
        i = i + s
    memcache.delete('db_cursor')

Any suggestions would be very much appreciated.

Edit: possible solutions

Above I had documented three possible solutions based on my research, plus suggestions etc

They aren't necessarily mutually exclusive, and could be a slight variation or combination of any of the three, however the gist of the solutions are there. Let me know which one you think makes the most sense, and might perform the best.

Solution A: Using mapreduce (or tasks), serialize each record, and create a memcache entry for each individual record keyed with the keyname. Then process these items individually into the pdf/xls file. (use get_multi and set_multi)

Solution B: Using tasks, serialize groups of records, and load them into the db as a blob. Then trigger a task once all records are processed that will load each blob, deserialize them and then load the data into the final file.

Solution C: Using mapreduce, retrieve the keynames and store them as a list, or serialized blob. Then load the records by key, which would be faster than the current loading method. If I were to do this, which would be better, storing them as a list (and what would the limitations be...I presume a list of 100,000 would be beyond the capabilities of the datastore) or as a serialized blob (or small chunks which I then concatenate or process)

Thanks in advance for any advice.

+1  A: 

Some time ago I faced the same problem with GAE. After many attempts I just moved to another web hosting since I could do it. Nevertheless, before moving I had 2 ideas how to resolve it. I haven't implemented them, but you may try to.

First idea is to use SOA/RESTful service on another server, if it is possible. You can even create another application on GAE in Java, do all the work there (I guess with Java's PDFBox it will take much less time to generate PDF), and return result to Python. But this option needs you to know Java and also to divide your app to several parts with terrible modularity.

So, there's another approach: you can create a "ping-pong" game with a user's browser. The idea is that if you cannot make everything in a single request, force browser to send you several. During first request make only a part of work which fits 30 seconds limit, then save the state and generate 'ticket' - unique identifier of a 'job'. Finally, send the user response which is simple page with redirect back to your app, parametrized by a job ticket. When you get it. just restore state and proceed with the next part of job.

Andrei
Great suggestions - just don't know if I have the time to implement them. I'm working on figuring out a few different options. I'll keep you updated!
etc
BTW, the only problem with your last approach is the final generation of the pdf file, which has to happen in one process. It seems to be timing out when there are more than 1500 records. :/
etc
Through I'm not familiar with ReportLab, I guess you can generate several parts of PDF separately and then concatenate them. Even in case you cannot merge several tables, you still can make a table of tables, which will seem the same.
Andrei
+2  A: 

Here is one quick thought, assuming it is crapping out fetching from the datastore. You could use tasks and cursors to fetch the data in smaller chunks, then do the generation at the end.

Start a task which does the initial query and fetches 300 (arbitrary number) records, then enqueues a named(!important) task that you pass the cursor to. That one in turn queries [your arbitrary number] records, and then passes the cursor to a new named task as well. Continue that until you have enough records.

Within each task process the entities, then store the serialized result in a text or blob property on a 'processing' model. I would make the model's key_name the same as the task that created it. Keep in mind the serialized data will need to be under the API call size limit.

To serialize your table pretty fast you could use:

serialized_data = "\x1e".join("\x1f".join(voter) for voter in data)

Have the last task (when you get enough records) kick of the PDf or CSV generation. If you use key_names for you models you, should be able to grab all of the entities with encoded data by key. Fetches by key are pretty fast, you'll know the model's keys since you know the last task name. Again, you'll want to be mindful size of your fetches from the datastore!

To deserialize:

list(voter.split('\x1f') for voter in serialized_data.split('\x1e'))

Now run your PDF / CSV generation on the data. If splitting up the datastore fetches alone does not help you'll have to look into doing more of the processing in each task.

Don't forget in the 'build' task you'll want to raise an exception if any of the interim models are not yet present. Your final task will automatically retry.

Robert Kluin
Wouldn't cycling through the results with a limit essentially be the same thing? I don't believe it's the data pull that's generating the timeout - although you could be correct. If I remember correctly, there is a 30 second timeout on the process, but a 10 second timeout on requests. 30 seconds should be more than enough time to actually process 1500-2000 records and output them in a PDF. This is timing out a lot quicker than that.
etc
My suggestion here was meant to partially separate the data retrieval from the processing. On App Engine fetching 1,500 entities could *easily* consume several *seconds* of processing time. So yes, if you have not already tried I would definitely try running this with a limit of 10. Also, if you have not already, you should use Appstats (http://code.google.com/appengine/docs/python/tools/appstats.html) and try to determine exactly what is taking so much time.
Robert Kluin
Great idea - that just got released with the new version didn't it? Haven't tried it out yet, but I'm going to do it now.
etc
For what it's worth, it looks like appstats only logs requests that complete...so timeouts don't seem to get logged. Sucks for debugging intense processes :(
etc
BTW: I created a loop so that it was only pulling 100 records at a time, and it seemed to get past pulling the records, however, it times out on the actual creation of the pdf...while it's calculating height etc, it poops out. I'm going to keep playing with it, and see if maybe manually setting the height works. I'll report back with the results.
etc
You do need to use cursors - fetches with an offset have to internally skip over the first records, which is inefficient. This is why cursors were provided in the first place.
Nick Johnson
Appstats has been around for some time, but they made it part of the SDK a couple releases ago. I thought it would only log data for a completed request, but if you make the number of records small you should still get valuable data about what is taking all the time. As Nick said, you'll want to use cursors for fetching entities across runs. You also want to pull as many records as you can at a time, it will be faster since you will minimize external calls.
Robert Kluin
I've changed my code to utilize cursors instead of offsets. I'm going to use a version of your "cacheing" idea, but with memcache instead of the datastore. I'll let you know how it works, and post the code for reference.
etc
I've been thinking - is this something that I should attempt to use mapreduce for, or is queueing really the best bet. Isn't mapreduce used mainly for datastore manipulations? If I'm using memcache to cache the items, would it be better to just chain tasks, like you suggest? Also, two more questions. Would it be most efficient if we were to preload the key names of the records we want to access into a datastore list using mapreduce, and then prime the cache using task queues? How would you keep the session going for the end user? Keep redirecting them instead of queueing them in the bg?
etc
I would personally start with task chaining, I think it will be easier to get a working solution. using memcache is a good idea. However, be careful; I would back it up with the datastore. Otherwise you could lose a chunk of your data and have to regenerate it. If you know key_names in advance, you could pull them more efficiently. I would just start as simple as you can, then look for ways to optimize. Perhaps you could redirect the user to a status page that gives them some feedback and eventually a link to the final report (or report generation handler).
Robert Kluin
I've uploaded an image with three potential solutions. What do you think? For solution 3 to work, I would have to generate a list of the key_names (potentially could do this with mapreduce)...is there a better way to retrieve just the keynames of applicable records? I guess I could queue up a "key generation" task, and then use another task to actually load the data. I can always programmatically trigger mapreduce routines. How would I "back it up" with datastore? Possibly compare a key list, and pull any records from the datastore that don't have an accompanying memcache element?
etc
BTW, thank you very much for all your help. Hopefully once I figure out a good solution, I can post and share the source code for others to use. It seems like this is something that many people would need solutions for.
etc
The Mapper API currently has an important limitations -- your stuck going over everything. So if you have a usable index, you'll likely be better off using cursors. It totally eliminates an unneeded step. I would not use A, it is not clear it buys you anything other than risk of cache misses. A combination of B and C might work if you have to go over *every* record anyway. You could make a small twist on B, check into incrementally generating your actual output. CSV is easy, just adapt my suggested serializer. For PDF maybe pickle could help you store interim parts of the PDF?
Robert Kluin
Good point regarding the mapper API. I wonder if there is any way to limit down running the mapper with a query. The reason A was thought of is because the performance benefits of pulling pre "cached" versions of the records from memcache. Thus the datastore loading is done in the background, and the actual generation of the report done from memcache. What is pickle?
etc
The biggest issue using memcache is that there *will* (at some point) be cache misses, so you'll need to design for that. You would probably need to dig into the Mapper source to make it accept a user specified query. Pickle (http://docs.python.org/library/pickle.html) serializes and deserializes Python objects. Perhaps you could use it to store partially generated ReportLab Table's, or something that could speed up the final build step.
Robert Kluin