views:

369

answers:

4

I have a bunch of data coming in (calls to an automated callcenter) about whether or not a person buys a particular product, 1 for buy, 0 for not buy.

I want to use this data to create an estimated probability that a person will buy a particular product, but the problem is that I may need to do it with relatively little historical data about how many people bought/didn't buy that product.

A friend recommended that with Bayesian probability you can "help" your probability estimate by coming up with a "prior probability distribution", essentially this is information about what you expect to see, prior to taking into account the actual data.

So what I'd like to do is create a method that has something like this signature (Java):

double estimateProbability(double[] priorProbabilities, int buyCount, int noBuyCount);

priorProbabilities is an array of probabilities I've seen for previous products, which this method would use to create a prior distribution for this probability. buyCount and noBuyCount are the actual data specific to this product, from which I want to estimate the probability of the user buying, given the data and the prior. This is returned from the method as a double.

I don't need a mathematically perfect solution, just something that will do better than a uniform or flat prior (ie. probability = buyCount / (buyCount+noBuyCount)). Since I'm far more familiar with source code than mathematical notation, I'd appreciate it if people could use code in their explanation.

A: 

Sounds like what you're trying to do is Association Rule Learning. I don't have time right now to provide you with any code, but I will point you in the direction of WEKA which is a fantastic open source data mining toolkit for Java. You should find plenty of interesting things there that will help you solve your problem.

n3rd
This is interesting, but I don't see how it solves the specific problem I describe :-/
sanity
+1 to counter ignorance/laziness; this is a very good suggestion
Steven A. Lowe
Steven, I've read the linked article on ARL in its entirety. Perhaps you could explain how this suggestion solves the specific problem I outline?
sanity
@sanity: ARL will help if you have something else to correlate with.
Steven A. Lowe
@Steven, I'm not trying to correlate with any additional metadata. I think you (and n3rd) are trying to solve a different problem to the one presented in the question.
sanity
A: 

As I see it, the best you could do is use the uniform distribution, unless you have some clue regarding the distribution. Or are you talking about making a relationship between this products and products previously bought by the same person in the Amazon Fashion "people who buy this product also buy..." ??

tekBlues
The clue regarding the distribution is provided in the priorProbabilities parameter to the method. This is a list of purchase probabilities which we found for other products - and it can be used (hopefully) to come up with a prior distribution for this product's purchase probability.
sanity
IMHO, you need to correlate the buy or not buy with some other parameter (for example, age, gender. country, time of year, time of the day, other products purchased, etc). Otherwise the best information you have is the uniform distribution using the accumulated purchase rate.
tekBlues
Really that is all I'm looking for at this point. Normally I would be looking to correlate with metadata like age and gender, but the problem is that there simply isn't enough data for that.My challenge here is to come up with the most accurate probability possible of making a purchase based on a minimal amount of data (perhaps only a few hundred calls, where the typical purchase rate is around 5-10%).Partitioning the data based on age or gender simply isn't possible because there isn't enough data for this.
sanity
@sanity: then you must use the uniform distribution, anything else falls in the domain of magic. If 20 in 100 people have bought, the probability of a new person buying is 1/5, nothing else.
tekBlues
@tekBlues: I don't think so. The actual discovered probabilities for other products form a prior distribution of where we'd expect this probability to lie. For example, if all other probabilities are between 5% and 15%, then a uniform distribution clearly isn't appropriate.
sanity
+1 for blunt truth.
Steven A. Lowe
+2  A: 

Here's the Bayesian computation and one example/test:

def estimateProbability(priorProbs, buyCount, noBuyCount):
  # first, estimate the prob that the actual buy/nobuy counts would be observed
  # given each of the priors (times a constant that's the same in each case and
  # not worth the effort of computing;-)`
  condProbs = [p**buyCount * (1.0-p)**noBuyCount for p in priorProbs]
  # the normalization factor for the above-mentioned neglected constant
  # can most easily be computed just once
  normalize = 1.0 / sum(condProbs)
  # so here's the probability for each of the prior (starting from a uniform
  # metaprior)
  priorMeta = [normalize * cp for cp in condProbs]
  # so the result is the sum of prior probs weighed by prior metaprobs
  return sum(pm * pp for pm, pp in zip(priorMeta, priorProbs))

def example(numProspects=4):
  # the a priori prob of buying was either 0.3 or 0.7, how does it change
  # depending on how 4 prospects bought or didn't?
  for bought in range(0, numProspects+1):
    result = estimateProbability([0.3, 0.7], bought, numProspects-bought)
    print 'b=%d, p=%.2f' % (bought, result)

example()

output is:

b=0, p=0.31
b=1, p=0.36
b=2, p=0.50
b=3, p=0.64
b=4, p=0.69

which agrees with my by-hand computation for this simple case. Note that the probability of buying, by definition, will always be between the lowest and the highest among the set of priori probabilities; if that's not what you want you might want to introduce a little fudge by introducing two "pseudo-products", one that nobody will ever buy (p=0.0), one that anybody will always buy (p=1.0) -- this gives more weight to actual observations, scarce as they may be, and less to statistics about past products. If we do that here, we get:

b=0, p=0.06
b=1, p=0.36
b=2, p=0.50
b=3, p=0.64
b=4, p=0.94

Intermediate levels of fudging (to account for the unlikely but not impossible chance that this new product may be worse than any one ever previously sold, or better than any of them) can easily be envisioned (give lower weight to the artificial 0.0 and 1.0 probabilities, by adding a vector priorWeights to estimateProbability's arguments).

This kind of thing is a substantial part of what I do all day, now that I work developing applications in Business Intelligence, but I just can't get enough of it...!-)

Alex Martelli
Thanks Alex, I'm glad someone appreciated the question :-) This definitely looks right but I won't be able to examine your answer in-detail until tomorrow. That being said, I'm happy to accept your answer for now :-)
sanity
By all means, check it out (transcoding to Java as needed, but do consider Jython for quick and dirty tests) and get back to me, either on this question or a new one, I'm at least as keen as you to get this working just right!-) *long live Bayes...!-)*
Alex Martelli
+1  A: 

A really simple way of doing this without any difficult math is to increase buyCount and noBuyCount artificially by adding virtual customers that either bought or didn't buy the product. You can tune how much you believe in each particular prior probability in terms of how many virtual customers you think it is worth.

In pseudocode:

def estimateProbability(priorProbs, buyCount, noBuyCount, faithInPrior=None):
    if faithInPrior is None: faithInPrior = [10 for x in buyCount]
    adjustedBuyCount = [b + p*f for b,p,f in 
                                zip(buyCount, priorProbs, faithInPrior]
    adjustedNoBuyCount = [n + (1-p)*f for n,p,f in 
                                zip(noBuyCount, priorProbs, faithInPrior]
    return [b/(b+n) for b,n in zip(adjustedBuyCount, adjustedNoBuyCount]
Jouni K. Seppänen