views:

218

answers:

1

The site I'm working on needs to fetch the tweets from 150-300 people, store them locally, and then list them on the front page. The profiles sit in groups.

The pages will be showing

  • the last 20 tweets (or 21-40, etc) by date, group of profiles, single profile, search, or "subject" (which is sort of a different group.. I think..)
  • a live, context-aware tag cloud (based on the last 300 tweets of the current search, group of profiles, or single profile shown)
  • various statistics (group stuffs, most active, etc) which depend on the type of page shown.

We're expecting a fair bit of traffic. The last, similar site peaked at nearly 40K visits per day, and ran intro trouble before I started caching pages as static files, and disabling some features (some, accidently..). This was caused mostly by the fact that a page load would also fetch the last x tweets from the 3-6 profiles which had not been updated the longest..

With this new site I can fortunately use cron to fetch tweets, so that helps. I'll also be denormalizing the db a little so it needs less joins, optimize it for faster selects instead of size.

Now, main question: how do I figure out which profiles to check for new tweets in an efficient manner? Some people will be tweeting more often than others, some will tweet in bursts (this happens a lot). I want to keep the front page of the site as "current" as possible. If it comes to, say, 300 profiles, and I check 5 every minute, some tweets will only appear an hour after the fact. I can check more often (up to 20K) but want to optimize this as much as possible, both to not hit the rate limit and to not run out of resources on the local server (it hit mysql's connection limit with that other site).
Question 2: since cron only "runs" once a minute, I figure I have to check multiple profiles each minute - as stated, at least 5, possibly more. To try and spread it out over that minute I could have it sleep a few seconds between batches or even single profiles. But then if it takes longer than 60 seconds altogether, the script will run into itself. Is this a problem? If so, how can I avoid that?
Question 3: any other tips? Readmes? URLs?

+1  A: 

I wouldn't use cron, just use twitter's streaming API with a filter for your 150-300 twitter users.

statuses/filter

Returns public statuses that match one or more filter predicates. At least one predicate parameter, follow, locations, or track must be specified. Multiple parameters may be specified which allows most clients to use a single connection to the Streaming API. Placing long parameters in the URL may cause the request to be rejected for excessive URL length. Use a POST request header parameter to avoid long URLs.

The default access level allows up to 200 track keywords, 400 follow userids and 10 1-degree location boxes. Increased access levels allow 80,000 follow userids ("shadow" role), 400,000 follow userids ("birddog" role), 10,000 track keywords ("restricted track" role), 200,000 track keywords ("partner track" role), and 200 10-degree location boxes ("locRestricted" role). Increased track access levels also pass a higher proportion of statuses before limiting the stream.

I believe that when specifying userids, you do infact get all tweets from the streaming api:

All streams that are not selected by user id have statuses from low-quality users removed. Results that are selected by user id, currently only results from the follow predicate, allow statuses from low-quality users to pass.

That sould allow you to get real-time results, without having to worry about rate limiting. You just need to make sure you can accept the data fast enough. But with 300 users that shouldn't be a problem.

Update - How to use the API: Unfortunately I've never had a chance to play with the streaming API. I have, however, daemonized php scripts before (yes, I know that's not a php strength, but if everything else you're doing is in php, it can be done).

I'd setup a simple php script to consume status then dump them (the raw JSON) into a message queue. I'd then point another script at the message queue to grab statuses and put the in the database. That way and db connectivity and processing time doesn't interfere with simply accepting the streamed data.

From the looks if it phirehose would fit in the first part of that solution. Something like beanstalkd (with pheanstalk) would work as the message queue.

Tim Lytle
Thank you for your reply. I'm looking at the documentation now, but am not sure this is something that I can use with PHP on a "regular" webserver (though we'll probably move this thing to Amazon cloud services later). It seems I would have to have a script running continuously, keeping the connection open, storing new tweets as they come in, is that correct? Do you have thoughts on that?
MSpreij
Ah, I found phirehose (http://code.google.com/p/phirehose/), looks like what I need.
MSpreij
Updated with 'how I would do this'.
Tim Lytle