views:

96

answers:

3

I have a big threaded feed retrieval script in python.

My question is, how can I load balance outgoing requests so that I don't hit any one host too often?

This is a big problem for feedburner, since a large percentage of sites proxy their RSS through feedburner and to further complicate matters many sites will alias a subdomain on their domain to feedburner to obscure the fact that they're using it (e.g. "mysite" sets its RSS url to feeds.mysite.com/mysite, where feeds.mysite.com bounces to feedburner). Sometimes it blocks me for awhile and redirects to their "automated requests" error page.

+2  A: 

If your problem is related to Feedburner "throttling you", it most certainly does this because of the source IP of your bot. The way to "load balance to Feedburner" would be to have multiple different source IPs to start from.

Now there are numerous ways to achieving this, 2 of them being:

  1. Multi-homed server: multiple IPs on the same machine
  2. Multiple discrete machines

Of course, don't you go a put a NAT box in front of them now ;-)


The above takes care of the possible "throttling problems", now for the "scheduling part". You should maintain a "virtual scheduler" per "destination" and make sure not to exceed the parameters of the Web Service (e.g. Feedburner) in question. Now, the tricky part is to get hold of these "limits"... sometimes they are advertised and sometimes you need to figure them out experimentally.

I understand this is "high level architectural guidelines" but I am not ready to be coding this for you... I hope you forgive me ;-)

jldupont
+1  A: 

"how can I load balance outgoing requests so that I don't hit any one host too often?"

Generally, you do this by designing a better algorithm.

For example, randomly scramble your requests.

Or shuffle them 'fairly' so so that you round-robin through the sources. That would be a simple list of queues where you dequeue one request from each host.

S.Lott
... that in no way will be helping if the services on the other end "throttle by source IP".
jldupont
... which any sensible Web Service **should** do anyways. "Always manage your perimeter or else..."
jldupont
thing is I need to like intercept urllib at the point of dns resolution to tell the load on each host
ʞɔıu
+3  A: 

You should probably do a one-time request (per week/month, whatever fits). for each feed and follow redirects to get the "true" address. Regardless of your throttling situation at the time, you should be able to resolve all feeds, save that data and then just do it once for every new feed you add to the list. You can look at urllib's geturl() as it returns the final url from the URL you put into it. When you do ping the feeds, be sure to use the original (keep the "real" simply for load-balancing) to make sure it redirects properly if the user has moved it or similar.

Once that is done, you can simply devise a load mechanism such as only X requests per hour for a given domain, going through each feed and skipping feeds whose hosts have hit the limit. If feedburner keeps their limits public (not likely) you can use that for X, but otherwise you will just have to estimate it and make a rough estimate that you know to be below the limit. Knowing google however, their limits might measure patterns and not have a specific hard limit.

Edit: Added suggestion from comment.

Christian P.
I'd change one-time to one time per day. People do change their redirects from time-to-time.
Hugh Brackett