ansaurus

Question

mysql/stats: Weighting an average to accentuate differences from the mean

Answer 1

+1 A:

Quick 'n' dirty: get a male score by multiplying the male probabilities, and a female score by multiplying the female probabilities. Predict the larger. (Actually, don't multiply; sum the log of each probability instead.) I think this is a maximum likelihood estimator if you make the right (highly unrealistic) assumptions.

2010-03-15 16:26:46

Answer 2

A:

The standard formula for calculating the weighted mean is given in this question and this question

I think you could look into these approaches and then work out how you calculate your weights.

In your gender example above you could adopt something along the lines of a set of weights {1, ..., 0 , ..., 1} which is a linear decrease from 0 to 1 for gender values of 0% male to 50% and then a corresponding increase up to 100%. If you want the effect to be skewed in favour of the outlying values then you easily come up with a exponential or trigonometric function that provides a different set of weights. If you wanted to then a normal distribution curve will also do the trick.

chillysapien 2010-03-15 16:43:17

I know how to calculate weighted means. ;-) The problem is that I'm not sure what an appropriate weighting function is, that won't presume that the value in question is evenly distributed (as gender approximately happens to be). A good answer will probably be based on z-scores.

Sai Emrys 2010-03-15 16:47:13

Answer 3

+2 A:

A good framework for these kinds of calculations is Bayesian inference. You have a prior distribution of the demographics - eg 50% male, 37% childless, etc. Preferrably, you would have it multivariately: 10% male childless 0-17 Caucasian ..., but you can start with one-at-a-time.
After this prior each site contributes new information about the likelihood of a demographic category, and you get the posterior estimate which informs your final guess. Using some independence assumptions the updating formula is as follows:

posterior odds = (prior odds) * (site likelihood ratio),

where odds = p/(1-p) and the likelihood ratio is a multiplier modifying the odds after visiting the site. There are various formulas for it, but in this case I would just use the above formula for the general population and the site's population to calculate it.

For example, for a site that has 35% of its visitors in the "under 20" agegroup, which represents 20% of the population, the site likelihood ratio would be LR = (0.35/0.65) / (0.2/0.8) = 2.154 so visiting this site would raise the odds of being "under 20" 2.154-fold.

A site that is 100% male would have an infinite LR, but you would probably want to limit it somewhat by, say, using only 99.9% male. A site that is 50% male would have an LR of 1, so it would not contribute any information on gender distribution.

Suppose you start knowing nothing about a person - his or her odds of being "under 20" are 0.2/0.8 = 0.25. Suppose the first site has an LR=2.154 for this outcome - now the odds of being "under 20" becomes 0.25*(2.154) = 0.538 (corresponding to the probability of 35%). If the second site has the same LR, the posterior odds become 1.16, which is already 54%, etc. (probability = odds/(1+odds)). At the end you would pick the category with the highest posterior probability.

There are loads of caveats with these calculations - for example, the assumption of independence likely being wrong, but it can provide a good start.

Aniko 2010-03-15 17:33:53

So, `prob(user is stat X) = (population prior prob. X) * product(site prob. X, over all sites user hit)` (assuming all sites are equally weighted)? Hmm. I think to do this efficiently I'm going to need some transform that allows me to use `SUM()` since mysql doesn't have an analogous `PROD()`. :/

Sai Emrys 2010-03-15 17:56:20

er, s/prob/prob * (1-prob)/ for the right hand ones

Sai Emrys 2010-03-15 17:57:50

@Sai: `PROD(x) = EXP(SUM(LN(x)))`, though the actual formula for this case is much more complex and requires recursion.

Quassnoi 2010-03-15 18:19:48

The question is, do you want to do it _right_, or do you want to do it in SQL?

Nick Johnson 2010-03-15 18:46:06

You can do all the calculations in the log-odds scale, and then the multiplications turn into additions as Quassnoi noted. In fact, the largest log-odds will correspond to the largest probability, so you don't need to actually calculate the probability (unless you want to report it).

Aniko 2010-03-15 19:01:16

@aniko - I do want to report it. @nick - Aww, can't I have both? :P

Sai Emrys 2010-03-15 21:21:44

Answer 4

+2 A:

The naive Bayesian formula for you case looks like this:

SELECT  probability
FROM    (
        SELECT  @apriori := CAST(@apriori * ratio / (@apriori * ratio + (1 - @apriori) * (1 - ratio)) AS DECIMAL(30, 30)) AS probability,
                @step := @step + 1 AS step
        FROM    (
                SELECT  @apriori := 0.5,
                        @step := 0
                ) vars,
                (
                SELECT  0.99 AS ratio
                UNION ALL
                SELECT  0.48
                UNION ALL
                SELECT  0.48
                UNION ALL
                SELECT  0.48
                UNION ALL
                SELECT  0.48
                UNION ALL
                SELECT  0.48
                UNION ALL
                SELECT  0.48
                UNION ALL
                SELECT  0.48
                ) q
        ) q2
ORDER BY
        step DESC
LIMIT 1

Quassnoi 2010-03-15 18:43:34

ansaurus

tags:

views:

answers:

mysql/stats: Weighting an average to accentuate differences from the mean

related questions