views:

284

answers:

6

Hi,
Take for example the list (L):
John, John, John, John, Jon

We are to presume one item is to be correct (e.g. John in this case), and give a probability it is correct. First (and good!) attempt: MostFrequentItem(L).Count / L.Count (e.g. 4/5 or 80% likelihood)

But consider the cases:
John, John, Jon, Jonny
John, John, Jon, Jon

I want to consider the likelihood of the correct item being John to be higher in the first list! I know I have to count the SecondMostFrequent Item and compare them.

Any ideas? This is really busting my brain!
Thx, Andrew

+2  A: 

Maybe Edit Distance? Just a direction to a solution, though...

Ron Klein
Thx. Edit distance is more to do with string similarity. This problem has to do with pure statistics.
Andrew White
The examples are similar, so I thought it's a combination of stats and similarity
Ron Klein
A: 

I'm not sure why you need to calculate the second most frequent item. In the latter example, couldn't you simply look at (the number of matching entries) / (the total number of entries) and say that it's correct with 4/8 probability? Is that not a sufficient metric? You would then also say that John has a 3/8 probability of being correct and Jonny has 1/8?

Why would that be insufficient for your purposes?

JF
I believe what he was getting at is that [John, John, Jon, Jonny] and [John, John, Jon, Jon] are 2 separate lists, and that 'John' should score higher on the first list than the 2nd.
Davy8
Andrew White
+1  A: 

Just off the top of my head, what if you compare the % occurrence vs the % if all items had equal number of occurences

In your example above
John, John, Jon, Jonny
50% John
25% Jon
25% Jonny
33.3% Normal? (I'm making up a word because I don't know what to call this. 3 items: 100%/3)
John's score = 50% - 33.3% = 16.7%

John, John, Jon, Jon
50% John
50% Jon
50% Normal (2 items, 100%/2)
John's score = 50% - 50% = 0%

If you had [John, John, John, Jon, Jon] then John's score would be 60%-50% = 10% which is lower than the first case, but higher than the 2nd (hopefully that's the desired result, otherwise you'll need to clarify more what the desired results should be)

In your first case [John, John, John, John, Jon] you'd get 80%-50% = 30%
For [John, John, John, John, Jon, Jonny] you'd get 66.6%-33.3% = 33.3%
That may or may not be what you want.

Where the above might factor in more is if you had John*97+Jon+Jonny+Johnny, that would give you 97%-25% = 72%, but John*99+Jon would only give you a score of 99-50% = 49%

You'd need to figure out how you want to handle the degenerate case of them all being the same, otherwise you'd get a score of 0% for that which is probably not what you want.

EDIT (okay I made lots of edits, but this one isn't just more examples :p)
To normalize the results, take the score as calculated above divide by the limit of max possible score given the number of different values. (Okay, that sounds way more complicated than it needs to, example time)

Example:
[John, John, Jon, Jonny] 50% - 33.3% = 16.7%. That's the previous score, but with 3 items the upper limit of your score would be 100%-33.3% = 66.6%, so if we take that into account, we have 16.7/66.6 = 25%

[John, John, Jon, Jon] gives (50-50) /50 = 0%
[John, John, John, Jon, Jon] gives (60-50) /50 = 20%
[John, John, John, John, Jon] gives (80-50)/50 = 60%
[John, John, John, John, Jon, Jonny] gives (66.6-33.3)/(66.6)= 50%
[John*97, Jon, Jonny, Johnny] gives (97-25)/75 = 96%
[John*99, Jon] gives (99-50)/50 = 98%

Davy8
Nice thinking! I didn't have the number of unique items in the list. But I could calculate it. Until now I just had the frequency of the 2 most common elements and the list count.
Andrew White
Andrew White
@Andrew I meant to make the edit above sooner, but Stackoverflow went down for maintenance as I was typing it up, but I think that should give you what you need.
Davy8
@Davy, Thanks! like the way this is going iteratively!The results are a bit pessimistic. The John, John, Jon, Jon is a good example. There is a definitely a 50% chance it is John (not 0)!Likewise "John" in John, John, John, Jon, Jon has an unbiased 50% chance of being right, but I would bias it higher...
Andrew White
A: 

I think you'd need a kind of scoring system.

Solely identifying the different tokens is not sufficient:

[John, Johen, Jon, Jhon, Johnn]

With your algorithm there is no clear winner here, whereas the most probable name is 'John', the others being just 1 away in the Damerau-Levenshtein distance.

So I would do a 2-steps process:

  1. Count the occurrences of each word
  2. For each word, add a "bonus" for each other word, inversely proportional to their distance

For the bonus, I would propose the following formula:

lhs = 'John'
rhs = 'Johen'

d = distance(lhs,rhs)
D = max( len(lhs), len(rhs) ) # Maximum distance possible

tmp = score[lhs]
score[lhs] += (1-d/D)*score[rhs]
score[rhs] += (1-d/D)*tmp

Note that you should not apply this first for (John, Johen) and then for (Johen, John).

Example:

# 1. The occurences
John  => 1
Johen => 1
Jon   => 1
Jhon  => 1
Johnn => 1

# 2. After executing it for John
John  => 4.1  = 1 + 0.80 + 0.75 + 0.75 + 0.80
Johen => 1.8  = (1) + 0.80
Jon   => 1.75 = (1) + 0.75
Jhon  => 1.75 = (1) + 0.75
Johnn => 1.8  = (1) + 0.80

# 3. After executing it for Johen (not recounting John :p)
John  => 4.1  = (1 + 0.80 + 0.75 + 0.75 + 0.80)
Johen => 3.8  = (1 + 0.80) + 0.60 + 0.60 + 0.80
Jon   => 2.35 = (1 + 0.75) + 0.60
Jhon  => 2.35 = (1 + 0.75) + 0.60
Johnn => 2.6  = (1 + 0.80) + 0.80

# 4. After executing it for Jon (not recounting John and Johen)
John  => 4.1  = (1 + 0.80 + 0.75 + 0.75 + 0.80)
Johen => 3.8  = (1 + 0.80 + 0.60 + 0.60 + 0.80)
Jon   => 3.7  = (1 + 0.75 + 0.60) + 0.75 + 0.60
Jhon  => 3.1  = (1 + 0.75 + 0.60) + 0.75
Johnn => 3.2  = (1 + 0.80 + 0.80) + 0.60

# 5. After executing it for Jhon(not recounting John, Johen and Jon)
John  => 4.1  = (1 + 0.80 + 0.75 + 0.75 + 0.80)
Johen => 3.8  = (1 + 0.80 + 0.60 + 0.60 + 0.80)
Jon   => 3.7  = (1 + 0.75 + 0.60 + 0.75 + 0.60)
Jhon  => 3.7  = (1 + 0.75 + 0.60 + 0.75) + 0.60
Johnn => 3.8  = (1 + 0.80 + 0.80 + 0.60) + 0.60

I'm not sure it's perfect and I have no idea how to transform this into some kind of percentage... but I think it gives a pretty accurate idea (of the most likely). Perhaps the bonus ought to be lessened (which factor ?) But check this degenerate case:

[John*99, Jon]

# 1. Occurences
John => 99
Jon  => 1

# 2. Applying bonus for John
John => 99.8 = (99) + 0.80
Jon  => 80.2 = (1) + 0.80*99

As you can see, it can't be directly converted into some kind of percentages: 99.8% of the correct result being 'John' seems low here!

Note: Implementing the distance efficiently is hard, kudos to Peter Norvig for his Python solution!

Matthieu M.
Wow. Nice idea!The problem is is that it is not really string distance based. My example with the names was just for clarity.It could be for example a list of countries:China, China, China, China, US, UK, UAE, GreeceWhat I am looking for is a probability for the "answer" being China.
Andrew White
+2  A: 

As an extremely simple solution, compared to the more correct but complicated solutions above, you could take counts of every variation, square the counts, and use those to calculate weights. So:

[John, John, Jon, Jonny]

would give John a weight of 4 and the other two a weight of 1, for a probability of 66% that John is correct.

[John, John, Jon, Jon]

would give weights of 4 to both John and Jon, so John's probability is only 50%.

tloflin
Is that Count(Max) / Count(John)2 + Count(Jon)2 + Count(Jonny)2= 4 / 2*2 + 1*1 + 1*1= 4 / 6
Andrew White
@Andrew, no, it's count(John)^2 / (count(John)^2 + count(Jon)^2 + count(Jonny)^2). It's the standard count(John)/count(total), just using squared counts instead of normal ones.
tloflin
A: 

First off, I suspect that you are using terms inconsistently. It will help if you use technical terms like "probability" and "likelihood" with strict correctness.

The probability of a thing allows us to reason from a parameter to an outcome. For example, suppose we have an unfair coin which is 60% likely to come up heads. The 60% is a parameter. From that we can reason that the probability of getting two heads in a row is 60% * 60% = 36%.

The likelihood of a thing allows us to reason from an outcome to a parameter. That is, we flip a pair of identical coins a thousand times and discover that we get two heads 36% of the time. We can compute "the likelihood of the probability of heads is 60% given the outcome that 36% of pairs were two heads".

Now, a reasonable question is "how confident can we be that we've deduced the correct parameter given the outcome?" If you flip pairs of coins a million times and get 36% double heads, it seems plausible that we can be very confident that the parameter for one coin is 60%. The likelihood is high. If we flip pairs of coins three times and get double heads 33% of the time, we have very little confidence that the parameter for one coin getting heads is close to 60%. It could be 50% or 40%, and we just got lucky one time in three. The likelihood is low.

All this is a preamble to simply asking you to clarify the question. You have an outcome: a bunch of results. Do you wish to make an estimate of the parameters that produced that outcome? Do you wish to get a confidence interval on that estimate? What exactly are you going for here?

Eric Lippert