views:

199

answers:

3

I just had a clever idea (I think).

Suppose you wanted to estimate the size of a userbase of a site which does not publicize this information.

People are more likely to have acquired different usernames with different probabilities. For instance, if the username 'nick' doesn't exist on the system, it's likely to have an extremely small userbase. If the username 'starbaby' is taken, it's likely to be a much larger site. It seems like a straightforward Bayesian problem.

There is the problem that different sites may have a different space of allowable usernames. The biggest problem would be the legality of common characters such as spaces, I imagine. Another issue that could taint the prior distribution is whether the site suggests names when the one you want is taken, or leaves you to think of a more creative name yourself.

How could you build a training set of the frequency of occurrence of usernames across different sized systems? Is there a way to use Bayes to do numeric estimation rather than classification into fixed-width buckets?

A: 

The only way is to get a large set of taken usernames on systems for which you know the size of the userbase. Data may be skewed in userbases where certain names are more common. Even a tiny userbase from a Lord of the Rings forum will likely contain the username Strider, for example.

brian
+2  A: 

I think this is a cool idea!

You may be able to put together a data set by using UserNameCheck.com for some different usernames and cross-referencing the results with the stated userbase sizes of those sites that give them out.

Note: that website does not seem to check if the usernames are valid for the site, so e.g. it thinks Gmail would let you register "[email protected]" even though that's too short.

A. Rex
If you knew these rules in advance if you wanted to estimate gmail's size you could just ignore such names from your prior distribution, if you're willing to assume independence.
ʞɔıu
(I was just noting that the service would have been better if it said "taken", "available", or "not allowed".) I think assuming independence is the right start here. The possible skewed results that you mention are real, but perhaps are "premature optimization". =)
A. Rex
+3  A: 

What you need to do is accurately estimate the probability that a certain user name is present given the number of users registered. Lets say N is the number of users and u = 1 if user u is present and 0 if they are absent.

First of all, make the assumption that the probability distributions for each user name are independent of each other. This is not going to be true - and you've already come up with one reason why - but it will probably be necessary since it makes the data collection and the maths a lot easier.

You are going to need a lot of data from sites with registered user names and the total number of users of that site. Now, take any specific user name and imagine your data points on a 2d plot (with N on x and u on y), there's going to be one horizontal line of points at y=0 and another at y=1. You can either bin the x axis as you suggest and take the mean y coordinate of all the data points in the bin to get a discrete function, or you could try to fit the points on the graph to some class of functions. I don't really know what that class of functions that would be - maybe some kind of power law? (I'm thinking of Zipf's law).

You now have the probability distributions to apply Bayes' rule. I don't know what kind of prior for N you would want to use. A uniform distribution (up to some large number) would make no assumptions, but I would guess most sites have a small user base.

I suspect that in order to make this work, when you sample users from a site you will need to do so for a specific set of users. I'm betting that the popularity of user names is going to have a very long tail and so a random sample of users is going to give you a lot of very infrequently used names and therefore a lot of uninformative evidence.

EDIT: I had another thought; in most forums (and on StackOverflow) users have consecutive user ids, so you can use a single site with a large number of users to give you estimates for all smaller N.

StompChicken