tags:

views:

387

answers:

3

Hi,

I am looking for scalable way to do the following:

  • User login
  • Fetch all Friends from Twitter
  • Fetch all Followers from Twitter
  • Display all Friends wich aren't Followers

The Problem: How can this be done in a scalable way? An user can have up to 2 million friends or followers. Currently I'm storing both inside an SQLite table and compare them through a loop. When the user comes back the table is cleared and process starts again.

This works finde on 100 - 1000 Friends.. but will be tricky with 500000 Friends. I cant cache the lists because they can change every moment..

Does anyone know a good way to handle such big amount of data?

A: 

Another thing to point out - do you need to display all friends that aren't followers at one time? If you only need to display a limited number at a time, 20 for example, then you can just calculate those 20; if they request more, then calculate more on the fly (or do it in the background as they browse your site; on each request, generate a few more).

I can't really imagine a situation where you would need to display a couple of million results in one page, even if that's the theoretical limit.

So, the approach that might work (from having a brief browse at their API documentation) would be to

  • grab a chunk of their friends (it appears that you get 100 per request anyway) using the statuses/friends API
  • for each retrieved friend
    • use the friendships/show to determine the follower status between the two
    • if you've got enough results (e.g. 20) then break, you're done

This approach does require more requests to the server than is permitted by twitter's rate limiting policies, but then again, getting the entire friend list of a user with 2,000,000 friends at 100 friends per request will also exceed the limit well before you get them all (150 requests x 100 per request = 15, 000). How do you plan to address this problem?

El Yobo
you could always have the user export their own follower feed and submit that as part of the startup process... Curious if there's a safe way to pass that off to a third party processor (dobutful)
drachenstern
A: 

Not the only way to do this, but effective: Run a crontab to download a list of twitter users every day from a site that has a public list (or twitter itself), then index those friends (run maybe 1000 every day). Then access the twitter API through PHP using cUrl to retreive a list of your friends- and match the arrays. This works well because you can improve your algorithm as you go- as noted above the limiting policies will prevent you from doing anything else. Good luck! =)

Stiggz
A: 

I don't know what your database looks like, but this is how I would set it up.

CREATE TABLE twitter_users (
    user_id INTEGER PRIMARY KEY NOT NULL,
    screen_name VARCHAR(20) NOT NULL
);

CREATE TABLE friends (
    friend_id INTEGER PRIMARY KEY NOT NULL
);

CREATE TABLE followers (
    friend_id INTEGER PRIMARY KEY NOT NULL
);

Then you can use this SQL to get the friends who are not followers.

SELECT friend_id, screen_name
FROM friends
LEFT JOIN followers ON follower_id = friend_id
LEFT JOIN twitter_users ON user_id = friend_id
WHERE follower_id IS NULL

If the screen name is NULL it means they are not in your twitter_users table. You can look up the missing users and store them for later. Screen names can change so you might need to update the table periodically.

Use the friends/ids and followers/ids APIs to get a list of friend and follower ids 5,000 at a time. Use the users/lookup API to get up to 100 screen names. If a user has 2,000,000 friends it will take 400 api calls to get the list of ids so you should still cache the list at least for popular users.

mcrumley