I have a simple problem. I have to fetch a url (about once a minute), check if there is any new content, and if there is, post it to another url.
I have a working system with a cronjob every minute that basically:
for link in models.Link.objects.filter(enabled=True).select_related():
# do it in two phases in case there is cross pollination
# get posts
twitter_posts, meme_posts = [], []
if link.direction == "t2m" or link.direction == "both":
twitter_posts = utils.get_twitter_posts(link)
if link.direction == "m2t" or link.direction == "both":
meme_posts = utils.get_meme_posts(link)
# process them
if len(twitter_posts) > 0:
post_count += views.twitter_link(link, twitter_posts)
if len(meme_posts) > 0:
post_count += views.meme_link(link, meme_posts)
count += 1
msg = "%s links crawled and %s posts updated" % (count, post_count)
This works great for the 150 users I have now, but the synchronousness of it scares me. I have url timeouts built-in, but at some point my cronjob will take > 1 minute, and I'll be left with a million of them running overwriting eachother.
So, how should I rewrite it?
Some issues:
- I don't want to hit the APIs too hard incase they block me. So I'd like to have at most 5 open connections to any API at any time.
- Users keep registering in the system as this runs, so I need some way to add them
- I'd like this to scale as well as possible
- I'd like to reuse as much existing code as I can
So, some thoughts I've had:
- Spawn a thread for each
link
- Use python-twisted - Keep one running process, that the cronjob just makes sure is running.
- Use stackless - Don't really know much about it.
- Ask StackOverflow :)
How would you do this?