views:

300

answers:

4

I am creating an application to help our team manage a twitter competition. So far I've managed to interact with the API fine, and return a set of tweets that I need.

I'm struggling to decide on the best way to handle the storage of the tweets in the database, how often to check for them and how to ensure there are no overlaps or gaps.

You can get a maximum number of 100 tweets per page. At the moment, my current idea is to run a cron script say, once every 5 minutes or so and grab a full 100 tweets at a time, and loop through them looking in the db to see if I can find them, before adding them.

This has the obvious drawback of running 100 queries against the db every 5 minutes, and however many INSERT there are also. Which I really don't like. Plus I would much rather have something a little more real time. As twitter is a live service, it stands to reason that we should update our list of entrants as soon as they enter.

This again throws up a drawback of having to repeatedly poll Twitter, which, although might be necessary, I'm not sure I want to hammer their API like that.

Does anyone have any ideas on an elegant solution? I need to ensure that I capture all the tweets, and not leave anyone out, and keeping the db user unique. Although I have considered just adding everything and then grouping the resultant table by username, but it's not tidy.

I'm happy to deal with the display side of things separately as that's just a pull from mysql and display. But the backend design is giving me a headache as I can't see an efficient way to keep it ticking over without hammering either the api or the db.

+1  A: 

The Twitter API offers a streaming API that is probably what you want to do to ensure you capture everything: http://dev.twitter.com/pages/streaming_api_methods

If I understand what you're looking for, you'll probably want a statuses/filter, using the track parameter with whatever distinguishing characteristics (hashtags, words, phrases, locations, users) you're looking for.

Many Twitter API libraries have this built in, but basically you keep an HTTP connection open and Twitter continuously sends you tweets as they happen. See the streaming API overview for details on this. If your library doesn't do it for you, you'll have to check for dropped connections and reconnect, check the error codes, etc - it's all in the overview. But adding them as they come in will allow you to completely eliminate duplicates in the first place (unless you only allow one entry per user - but that's client-side restrictions you'll deal with later).

As far as not hammering your DB, once you have Twitter just sending you stuff, you're in control on your end - you could easily have your client cache up the tweets as they come in, and then write them to the db at given time or count intervals - write whatever it has gathered every 5 minutes, or write once it has 100 tweets, or both (obviously these numbers are just placeholders). This is when you could check for existing usernames if you need to - writing a cached-up list would allow you the best chance to make things efficient however you want to.

Update: My solution above is probably the best way to do it if you want to get live results (which it seems like you do). But as is mentioned in another answer, it may well be possible to just use the Search API to gather entries after the contest is over, and not worry about storing them at all - you can specify pages when you ask for results (as outlined in the Search API link), but there are limits as to how many results you can fetch overall, which may cause you to miss some entries. What solution works best for your application is up to you.

cincodenada
Thanks, this does seem like the most flexible solution.
DavidYell
additionally, if you add a UNIQUE constraint to the tweet's id, you can bulk load the tweets from a CSV file and not worry about duplicates.
Jayrox
you can use PHP's `fputcsv` to save the data you need from the tweet into a flat file (very fast). Then use MySQL's `load data local infile` and bulk load the tweets into the database. This is also very fast.
Jayrox
A: 

I read over your question and it seems to me that you want to duplicate data already stored by Twitter. Without more specifics on the competition your running, how users enter for example, estimated amount of entries; its impossible to know whether or not storing this information locally on a database is the best way to approach this problem.

Might a better solution to be, skip storing duplicate data locally and drag the entrants directly from twitter, i.e. when your attempting to find a winner. You could eliminate duplicate entries on-the-fly then whilst the code is running. You would just need to call "the next page" once its finished processing the 100 entries its already fetched. Although, i'm not sure if this is possible directly through the Twitter API.

It is possible, and I have achieved this functionality already. Plus for metrics and records, the guys want the stuff filed with us as well as on Twitter :)
DavidYell
+2  A: 

100 queries in 5 minutes is nothing. Especially since a tweet has essentially only 3 pieces of data associated with it: user ID, timestamp, tweet, tweet ID - say, about 170 characters worth of data per tweet. Unless you're running your database on a 4.77MHz 8088, your database won't even blink at that kind of "load"

Marc B
tweets from the API have significantly more data associated with them than 170 characters. json returns from twitter per tweet can be over 3KB and often are.
Jayrox
A: 

I think running a cron every X minutes and basing it off of the tweets creation date may work. You can query your database to find the last date/time of the last recorded tweet, then only run selects if there are matching times to prevent duplicates. Then, when you do your inserts into the database, use one or two insert statements containing all the entries you want to record to keep performance up.

INSERT INTO `tweets` (id, date, ...) VALUES (..., ..., ...), (..., ..., ...), ...;

This doesn't seem too intensive...also depends on the number of tweets you expect to record though. Also make sure to index the table properly.

Aaron W.