views:

107

answers:

4

Considering the set of Twitter users "nodes" and the relation u follows v as the "edges", we have a graph from which I would like to select a subset of the users at random. I could be wrong, but from reading the API docs I think it's impossible to get a collection of users except by getting the followers or friends of an already-known user.

So, starting from myself and exploring the Twitter graph from there, what's a good way to select a random sample of (say 100) users?

+1  A: 

Unless you have the entire twitter user graph (or a random sample of it), you won't be able to take a random sample. Otherwise, any sample you take will be biased by its relationship to you.

Donnie DeBoer
Yes, I agree, the randomness won't be perfect. But, as an impractical example, suppose I started with myself and took 10,000 random steps. The user I landed on would be pretty random.
I. J. Kennedy
@I.J: Not true. It really depends on the structure of the graph. Now you could make some assumptions which will imply that, but who knows what the twitter user graph looks like.
Moron
+1  A: 

Assuming the six degrees of separation is true, you could do a Breadth first search upto 6 levels and select 100 random users from that list. Or you could say, I will stop looking for more users when I get say, a million unique users and sample 100 from that.

Since storing a list of million users and trying to sample might be prohibitive, there is a technique called Reservoir Sampling which you can use, that allows you to sample during the traversal itself.

Moron
+1  A: 

Just query the public timeline, and use the set of users returned:

http://apiwiki.twitter.com/Twitter-REST-API-Method%3A-statuses-public_timeline

It won't be random, since it's just the last 20 tweets sent by anyone, but it will most likely never be the same set of users twice.

Since it only gives you 20 at a time, and the results are cached on their servers for 60 seconds, you'll have to do 5 different requests with a 60 second pause in between them.

Of course, it's also possible that some users will be tweeting frequently in a certain time period, so you might get less than 100 users total in that time, so you could just loop until you've gotten 100, if you need to.

pib
+1  A: 

I would use the numerical user id. Generate a bunch of random numbers, and fetch users based on that. If you hit a nonexistent id, simply skip that.

The Twitter API wiki, for users/show:

id. The ID or screen name of a user.

Joel L
Thanks. Do you know the range of numerical user ids?
I. J. Kennedy
You could create a new account, and see what id it gets (easiest to look at the RSS feed url, which includes the user id). My user id is ~1200, so I guess they started at 1 (or near that).
Joel L
If you can figure out a structure of the IDs, this is probably a very good option.
Moron
This will only work if the range of numerical IDs has no holes, or if the distribution of holes in the IDs is uniform across the range of IDs. If there is a non-uniform distribution of holes in the ID range, then generating random IDs and skipping invalid ones (holes) will result in a biased sample of users. Imagine there are more holes the higher you go in the ID range (non-uniform distribution of holes). If you select 100 random IDs in the range, your sample will be biased toward low-ID users. This could be a big problem if user ID correlates to some other user trait you care about.
Donnie DeBoer