using task queues to schedule the fetching/parsing of a number of feeds in appengine python

views:

180

answers:

using task queues to schedule the fetching/parsing of a number of feeds in appengine python

Say I had over 10,000 feeds that I wanted to periodically fetch/parse. If the period were say 1h that would be 24x10000 = 240,000 fetches.

The current 10k limit of the labs taskqueue api would preclude one from setting up one task per fetch. How then would one do this?

Update: re: fetching nurls per task: given the 30second timeout per request at some point this would hit a ceiling. Is there anyway to paralellize it so each tasqueue initiates a bunch of async paralell fetches each of which would take less than 30sec to finish but the lot togethere may take more than that

+2 A:

2 fetches per task? 3?

Will Hartung 2009-07-18 22:10:25

Group up the fetches, so instead of queuing 1 fetch you queue up, say, a work unit that does 10 fetches.

nos 2009-07-18 22:16:01

please see the update to the question.

molicule 2009-07-18 22:39:02

+3 A:

Here's the asynchronous urlfetch API:

http://code.google.com/appengine/docs/python/urlfetch/asynchronousrequests.html

Set of a bunch of requests with a reasonable deadline (give yourself some headroom under your timeout, so that if one request times out you still have time to process the others). Then wait on each one in turn and process as they complete.

I haven't used this technique myself in GAE, so you're on your own finding any non-obvious gotchas. Sadly there doesn't seem to be a select() style call in the API to wait for the first of several requests to complete.

Steve Jessop 2009-07-18 23:28:57

So if I'm reading the async docs correctly, after a bunch of rpc's are initiated and closed with <code>for rpc in rpcs: rpc.wait()</code> the original taskqueue returns (satisfying the 30s timeout for http calls). The async urlfetches however are working away (headless) and they are processed by the callback associated with it. Is that right??

molicule 2009-07-19 00:39:29

No, wait() calls the callback before it returns.

Steve Jessop 2009-07-19 10:06:13

ansaurus

tags:

views:

answers:

using task queues to schedule the fetching/parsing of a number of feeds in appengine python

related questions