views:

21

answers:

2

For a sequence of values, how would one find the sorted middle 50% of the sequence in Boost Accumulators?

For example, let's say that I have the following sequence,
25, 21, 9, 13, 17, 19, 12, 29, 50, 97, 10, 11.

The middle 50% data I would like to have is as follows:
13, 17, 19, 21.

Of course, one can sort the sequence, which now becomes
9, 10, 11, 12, 13, 17, 19, 21, 25, 29, 50, 97.

And then one can collect the middle 50% data.

Now, does the Accumulators framework internally store and sort the sequence? If yes, is it possible to retrieve the value that reside in a particular index?

Reading from here, I think the Accumulators framework does not store the original data and this framework is not appropriate for the problem I am trying to solve.

While writing this, I find it a bit foolish to try to accomplish this using Accumulators. However, I was using it for other purposes and I was expecting a solution in Accumulators.

Now, is it possible to build a data structure that efficiently maintain a current and sorted middle 50% data in a way that the size of the data structure almost never exceeds the half of the size of the sequence?

Thinking about it for a while, I guess it may not be possible to devise such a data structure. At the first thought, I thought that some of the values can be forgotten/discarded permanently assuming they will never appear in the sorted middle 50%. However, this assumption is probably wrong, and some values might reappear in the sorted middle 50% depending on the yet to arrive values in the sequence.

+2  A: 

You're looking for a selection algorithm.

Using a selection algorithm, you can find the 25th and 75th percentile elements of your data set in linear time. Then, you just loop through your array and pick out all of the elements that are between the 25th and 75th percentiles to get the middle 50% you're looking for.

Adam Rosenfield
+1  A: 

As @Adam Rosenfeld pointed out, you're looking for Hoare's selection algorithm, which is a variant of the Quick Sort.

What he didn't point out is that with a bit of care, you can have it put the elements you care about in the right place as part of the selection process.

The selection algorithm partitions the data into those less than and those greater than the selected element. For example, assume you had an array of 100 elements, and you wanted the 25th. It would arrange the elements so the first through 24th elements in the array were smaller than the 25th, then would be the 25th, then the larger elements. Those on each side are only ordered with respect to that one element though, not compared to each other.

You can still take advantage of that to get the middle 50% quickly: first select the 25th percentile. Then specify only the part above the 25th as input, and find the element 2/3rds of the way across that part. That will give you the 25th, the 75th, and (the important part) all of the elements whose values are between those will also be arranged between those elements (though inside of that range, the order will be basically random).

Jerry Coffin