The site I'm working on needs to fetch the tweets from 150-300 people, store them locally, and then list them on the front page. The profiles sit in groups.
The pages will be showing
- the last 20 tweets (or 21-40, etc) by date, group of profiles, single profile, search, or "subject" (which is sort of a different group.. I think..)
- a live, context-aware tag cloud (based on the last 300 tweets of the current search, group of profiles, or single profile shown)
- various statistics (group stuffs, most active, etc) which depend on the type of page shown.
We're expecting a fair bit of traffic. The last, similar site peaked at nearly 40K visits per day, and ran intro trouble before I started caching pages as static files, and disabling some features (some, accidently..). This was caused mostly by the fact that a page load would also fetch the last x tweets from the 3-6 profiles which had not been updated the longest..
With this new site I can fortunately use cron to fetch tweets, so that helps. I'll also be denormalizing the db a little so it needs less joins, optimize it for faster selects instead of size.
Now, main question: how do I figure out which profiles to check for new tweets in an efficient manner? Some people will be tweeting more often than others, some will tweet in bursts (this happens a lot). I want to keep the front page of the site as "current" as possible. If it comes to, say, 300 profiles, and I check 5 every minute, some tweets will only appear an hour after the fact. I can check more often (up to 20K) but want to optimize this as much as possible, both to not hit the rate limit and to not run out of resources on the local server (it hit mysql's connection limit with that other site).
Question 2: since cron only "runs" once a minute, I figure I have to check multiple profiles each minute - as stated, at least 5, possibly more. To try and spread it out over that minute I could have it sleep a few seconds between batches or even single profiles. But then if it takes longer than 60 seconds altogether, the script will run into itself. Is this a problem? If so, how can I avoid that?
Question 3: any other tips? Readmes? URLs?