views:

100

answers:

3

A user visits my website at time t, and they may or may not click on a particular link I care about, if they do I record the fact that they clicked the link, and also the duration since t that they clicked it, call this d.

I need an algorithm that allows me to create a class like this:

class ClickProbabilityEstimate {
    public void reportImpression(long id);
    public void reportClick(long id);

    public double estimateClickProbability(long id);
}

Every impression gets a unique id, and this is used when reporting a click to indicate which impression the click belongs to.

I need an algorithm that will return a probability, based on how much time has past since an impression was reported, that the impression will receive a click, based on how long previous clicks required. Clearly one would expect that this probability will decrease over time if there is still no click.

If necessary, we can set an upper-bound, beyond which we consider the click probability to be 0 (eg. if its been an hour since the impression occurred, we can be pretty sure there won't be a click).

The algorithm should be both space and time efficient, and hopefully make as few assumptions as possible, while being elegant. Ease of implementation would also be nice. Any ideas?

+2  A: 

Assuming you keep data on past impressions and clicks, it's easy: let's say that you have an impression, and a time d' has passed since that impression. You can divide your data into three groups:

  1. Impressions which received a click in less than d'
  2. Impressions which received a click after more than d'
  3. Impressions which never received a click

Clearly the current impression is not in group (1), so eliminate that. You want the probability it is in group (2), which is then

P = N2 / (N2 + N3)

where N2 is the number of impressions in group 2, and similarly for N3.

As far as actual implementation, my first thought would be to keep an ordered list of the times d for past impressions which did receive clicks, along with a count of the number of impressions which never received a click, and just do a binary search for d' in that list. The position you find will give you N1, and then N2 is the length of the list minus N1.

If you don't need perfect granularity, you can store the past times as a histogram instead, i.e. a list that contains, in each element list[n], the number of impressions that received a click after at least n but less than n+1 minutes. (Or seconds, or whatever time interval you like) In that case you'd probably want to keep the total number of clicks as a separate variable so you can easily compute N2.

(By the way, I just made this up, I don't know if there are standard algorithms for this sort of thing that may be better)

David Zaslavsky
+1  A: 

See this article:

Estimating the chances of something that hasn't happened yet.

John D. Cook
A: 

I would suggest hypothesizing an arrival process (clicks per minute) and trying to fit a distribution to that arrival process using your existing data. I'll bet the result is negative binomial which is what you get when you have a poisson arrival process with a non-stationary mean if the mean has a gamma distribution. The inverse (minutes per click) gives you the distribution of the interarrival process. Don't know if there's a distribution named for that, but you can create an empirical one.

Hope this helps.

Grembo