tags:

views:

149

answers:

3

I have an application with some probabilities of measured features. I want to select n-best features from vector. I have a vector of real numbers. Vector is normalized, sum of all numbers is 1 (it is probability of some features).

I want to select group of n less than N (assume approx. 8) largest numbers. Numbers has to be close together without gaps and they're also should have large sum (sum of remaining numbers should be several times lower).

Any ideas how to accomplish that?

I tried to use 80% quantile (but it is not sensitive to relative large gaps like [0.2, 0.2, 0.01, 0.01, 0.001, 0.001 ... len ~ 100] ), I tried a some treshold between two successive numbers, but nothing work too good.

I have some partial solution at this moment but I am just wondering if there is some simple solution that I have overlooked.

+2  A: 

It sounds like you're wanting to select the n largest probabilities but the number n is flexible. If n were fixed, say n=10, you could just sort your vector and pull out the top 10 items. But from your example it sounds like you'd like to use a smaller value of n if there's a natural break in the data. Maybe you want to start with the largest probability and go down the list selecting items until the sum of the probabilities you pick crosses some threshold.

Maybe you have an implicit optimization problem where you want to maximize some probability with some penalty for large n. Try stating your problem that way. You might find your own answer, or you might be able to rephrase your question here in a way that helps other people give you a better answer.

John D. Cook
Sorry for that: half of the post is missing. I am going to clarify that.
Jiri
+3  A: 

John's answer is good. Also you might try

  • sort the probabilities
  • find the largest gap between successive probabilities
  • work up from there

From there, it's starting to sound like a pattern-recognition problem.
My favorite method is markov-chain-monte-carlo(MCMC).

Edit: Since you clarified your question, my first thought is, since you only have 8 possible answers, develop a score for each one, based on how much probability it contains and whether or not it splits at a gap, and make a heuristic judgement.

Further edit: This sounds a bit like logistic regression. You want to find a value of P that effectively divides your set into members and non-members. For a given value of P, you can compute a log-likelihood for the ensemble, and choose P that maximizes that.

Mike Dunlavey
I have them sorted already. I have tried to find a gap, but 'size' of the gap varies as count of the numbers grows and also as 'size' of the largest group grows. Example: in the largest group of 3 numbers, they will have value around 0.3, but 8 large numbers will have value around 1/8.
Jiri
Couldn't you just go down the list, pick the largest gap, and use that?
Mike Dunlavey
"Example: in the largest group of 3 numbers, they will have value around 0.3, but 8 large numbers will have value around 1/8." Ok - so _what_ do you want then? I don't think your question can be solved rationally.
Daniel Daranas
I have just tried to implement 'largest gap'. It has passed all my tests and it is much simpler than my own solution. Thanks.
Jiri
You're welcome. (Aside: we have a house guest for the last year, from the Czech Republic, named Jiri, so I even know how to pronounce it [not that I do it correctly].)
Mike Dunlavey
+1  A: 

I'm not really sure if this is what you want, but it seems you want to do the following.

Lets assume that the probabilities are x_1,...,x_N in increasing order. Then you should try to find 1<= i < j <= N such that the function

f(i,j)  =  (x_i + x_(i+1) + ... + x_j)/(x_j - x_i)

is maximized. This can be done naively in quadratic time.

Il-Bhima
Interesting. I've got to think about that.
Mike Dunlavey