views:

233

answers:

4

This is for a new feature on http://cssfingerprint.com (see /about for general info).

The feature looks up the sites you've visited in a database of site demographics, and tries to guess what your demographic stats are based on that.

All my demgraphics are in 0..1 probability format, not ratios or absolute numbers or the like.

Essentially, you have a large number of data points that each tend you towards their own demographics. However, just taking the average is poor, because it means that by adding in a lot of generic data, the number goes down.

For example, suppose you've visited sites S0..S50. All except S0 are 48% female; S0 is 100% male. If I'm guessing your gender, I want to have a value close to 100%, not just the 49% that a straight average would give.

Also, consider that most demographics (i.e. everything other than gender) does not have the average at 50%. For example, the average probability of having kids 0-17 is ~37%. The more a given site's demographics are different from this average (e.g. maybe it's a site for parents, or for child-free people), the more it should count in my guess of your status.

What's the best way to calculate this?

For extra credit: what's the best way to calculate this, that is also cheap & easy to do in mysql?

ETA: I think that something approximating what I want is Φ(AVG(z-score ^ 2, sign preserved)). But I'm not sure if this is a good weighting function.

(Φ is the standard normal distribution function - http://en.wikipedia.org/wiki/Standard_normal_distribution#Definition)

+1  A: 

Quick 'n' dirty: get a male score by multiplying the male probabilities, and a female score by multiplying the female probabilities. Predict the larger. (Actually, don't multiply; sum the log of each probability instead.) I think this is a maximum likelihood estimator if you make the right (highly unrealistic) assumptions.

A: 

The standard formula for calculating the weighted mean is given in this question and this question

I think you could look into these approaches and then work out how you calculate your weights.

In your gender example above you could adopt something along the lines of a set of weights {1, ..., 0 , ..., 1} which is a linear decrease from 0 to 1 for gender values of 0% male to 50% and then a corresponding increase up to 100%. If you want the effect to be skewed in favour of the outlying values then you easily come up with a exponential or trigonometric function that provides a different set of weights. If you wanted to then a normal distribution curve will also do the trick.

chillysapien
I know how to calculate weighted means. ;-) The problem is that I'm not sure what an appropriate weighting function is, that won't presume that the value in question is evenly distributed (as gender approximately happens to be). A good answer will probably be based on z-scores.
Sai Emrys
+2  A: 

A good framework for these kinds of calculations is Bayesian inference. You have a prior distribution of the demographics - eg 50% male, 37% childless, etc. Preferrably, you would have it multivariately: 10% male childless 0-17 Caucasian ..., but you can start with one-at-a-time.
After this prior each site contributes new information about the likelihood of a demographic category, and you get the posterior estimate which informs your final guess. Using some independence assumptions the updating formula is as follows:

posterior odds = (prior odds) * (site likelihood ratio),

where odds = p/(1-p) and the likelihood ratio is a multiplier modifying the odds after visiting the site. There are various formulas for it, but in this case I would just use the above formula for the general population and the site's population to calculate it.

For example, for a site that has 35% of its visitors in the "under 20" agegroup, which represents 20% of the population, the site likelihood ratio would be LR = (0.35/0.65) / (0.2/0.8) = 2.154 so visiting this site would raise the odds of being "under 20" 2.154-fold.

A site that is 100% male would have an infinite LR, but you would probably want to limit it somewhat by, say, using only 99.9% male. A site that is 50% male would have an LR of 1, so it would not contribute any information on gender distribution.

Suppose you start knowing nothing about a person - his or her odds of being "under 20" are 0.2/0.8 = 0.25. Suppose the first site has an LR=2.154 for this outcome - now the odds of being "under 20" becomes 0.25*(2.154) = 0.538 (corresponding to the probability of 35%). If the second site has the same LR, the posterior odds become 1.16, which is already 54%, etc. (probability = odds/(1+odds)). At the end you would pick the category with the highest posterior probability.

There are loads of caveats with these calculations - for example, the assumption of independence likely being wrong, but it can provide a good start.

Aniko
So, `prob(user is stat X) = (population prior prob. X) * product(site prob. X, over all sites user hit)` (assuming all sites are equally weighted)? Hmm. I think to do this efficiently I'm going to need some transform that allows me to use `SUM()` since mysql doesn't have an analogous `PROD()`. :/
Sai Emrys
er, s/prob/prob * (1-prob)/ for the right hand ones
Sai Emrys
@Sai: `PROD(x) = EXP(SUM(LN(x)))`, though the actual formula for this case is much more complex and requires recursion.
Quassnoi
The question is, do you want to do it _right_, or do you want to do it in SQL?
Nick Johnson
You can do all the calculations in the log-odds scale, and then the multiplications turn into additions as Quassnoi noted. In fact, the largest log-odds will correspond to the largest probability, so you don't need to actually calculate the probability (unless you want to report it).
Aniko
@aniko - I do want to report it. @nick - Aww, can't I have both? :P
Sai Emrys
+2  A: 

The naive Bayesian formula for you case looks like this:

SELECT  probability
FROM    (
        SELECT  @apriori := CAST(@apriori * ratio / (@apriori * ratio + (1 - @apriori) * (1 - ratio)) AS DECIMAL(30, 30)) AS probability,
                @step := @step + 1 AS step
        FROM    (
                SELECT  @apriori := 0.5,
                        @step := 0
                ) vars,
                (
                SELECT  0.99 AS ratio
                UNION ALL
                SELECT  0.48
                UNION ALL
                SELECT  0.48
                UNION ALL
                SELECT  0.48
                UNION ALL
                SELECT  0.48
                UNION ALL
                SELECT  0.48
                UNION ALL
                SELECT  0.48
                UNION ALL
                SELECT  0.48
                ) q
        ) q2
ORDER BY
        step DESC
LIMIT 1
Quassnoi