views:

180

answers:

3

Say I had over 10,000 feeds that I wanted to periodically fetch/parse. If the period were say 1h that would be 24x10000 = 240,000 fetches.

The current 10k limit of the labs taskqueue api would preclude one from setting up one task per fetch. How then would one do this?

Update: re: fetching nurls per task: given the 30second timeout per request at some point this would hit a ceiling. Is there anyway to paralellize it so each tasqueue initiates a bunch of async paralell fetches each of which would take less than 30sec to finish but the lot togethere may take more than that

+2  A: 

2 fetches per task? 3?

Will Hartung
A: 

Group up the fetches, so instead of queuing 1 fetch you queue up, say, a work unit that does 10 fetches.

nos
please see the update to the question.
molicule
+3  A: 

Here's the asynchronous urlfetch API:

http://code.google.com/appengine/docs/python/urlfetch/asynchronousrequests.html

Set of a bunch of requests with a reasonable deadline (give yourself some headroom under your timeout, so that if one request times out you still have time to process the others). Then wait on each one in turn and process as they complete.

I haven't used this technique myself in GAE, so you're on your own finding any non-obvious gotchas. Sadly there doesn't seem to be a select() style call in the API to wait for the first of several requests to complete.

Steve Jessop
So if I'm reading the async docs correctly, after a bunch of rpc's are initiated and closed with <code>for rpc in rpcs: rpc.wait()</code> the original taskqueue returns (satisfying the 30s timeout for http calls). The async urlfetches however are working away (headless) and they are processed by the callback associated with it. Is that right??
molicule
No, wait() calls the callback before it returns.
Steve Jessop