ansaurus

Question

how would I sort a list and get the top K elements? (STL)

Answer 1

A:

Not sure about pre-canned algorithms, but take a look at selection algorithms; if you need the top K elements of a set of N values and N is much larger than K, there are much more efficient methods.

If you can create an indexing class (like @user470379's answer -- basically a class that encapsulates a pointer/index to the "real" data which is read-only), then use a priority queue of maximum size K, and add each unsorted element to the priority queue, popping off the bottom-most element when the queue reaches size K+1. In cases like N = 10⁶, K = 100, this handles cases much more simply + efficiently than a full sort.

Jason S 2010-10-19 19:24:20

Answer 2

+1 A:

The first thing that comes to mind is somewhat hackish, but you could define a struct that stored both the double and its original index, then overload the < operator to sort based on the double:

struct s {
    double d;
    int index;
    bool operator < (const struct &s) const {
        return d < s.d;
    }
};

Then you could retrieve the original indices from the struct.

Fuller example:

vector<double> orig;
vector<s> v;
...
for (int i=0; i < orig.size(); ++i) {
    s s_temp;
    s_temp.d = orig[i];
    s_temp.index = i;
    v.push_back(s);
}
sort(v.begin(), v.end());
//now just retrieve v[i].index

This will leave them sorted from smallest to largest, but you could overload the > operator instead and then pass in greater to the sort function if wanted.

2010-10-19 19:25:11

-1: too slow for the thing,that you want.......

Kiril Kirov 2010-10-19 19:44:39

nth_element as you suggested wouldn't work if the top N needed to be sorted themselves; partial_sort would only be significantly quicker if K was pretty small and n rather large -- n log(K) as opposed to n log(n); but feel free to use partial_sort instead if you need the performance, but you'll have to overload the > operator and sort from biggest to smallest then (partial_sort sorts the top part of the elements, not the bottom part)

2010-10-19 20:04:21

Yes, you're right. But when you don't have any expectations about K and N (don't know how much K will be smaller than N), it's better to use partial sort (I think) - it could be faster in 50% of the cases, but it's still a less-time-lost. And just one remark - you don't need to overload operator> to do the oposite and make it seem complicated, if you need. You could just pass a function, defined by you, to partial_sort. For example: `partial_sort( myvector.begin(), myvector.begin() + 5, myvector.end(), myfunction );` It's similar with nth_element, too.

Kiril Kirov 2010-10-20 08:22:30

Answer 3

A:

OK, how about this?

bool isSmaller (std::pair<double, int> x, std::pair<double, int> y)
{
   return x.first< y.first;
}

int main()
{
   //...
   //you have your vector<double> here, say name is d;
   std::vector<std::pair<double, int> > newVec(d.size());
   for(int i = 0; i < newVec.size(); ++i)
   {
      newVec[i].first = d[i];
      newVec[i].second = i;  //store the initial index
   }
   std::sort(newVec.begin(), newVec.end(), &isSmaller);
   //now you can iterate through first k elements and the second components will be the initial indices
}

Armen Tsirunyan 2010-10-19 19:27:32

Answer 4

A:

Use multimap for vector's (value, index) to handle dups. Use reverse iterators to walk results in descending order.

#include <multimap>
#include <vector>
using namespace std;

multimap<double, size_t> indices;
vector<double> values;

values.push_back(1.0);
values.push_back(2.0);
values.push_back(3.0);
values.push_back(4.0);

size_t i = 0;
for(vector<double>::const_iterator iter = values.begin(); 
        iter != values.end(); ++iter, ++i)
{
    indices.insert(make_pair<double,int>(*iter, i));
}

i = 0;
size_t limit = 2;
for (multimap<double, size_t>::const_reverse_iterator iter = indices.rbegin(); 
    iter != indices.rend() && i < limit; ++iter, ++i)
{
    cout << "Value " << iter->first << " index " << iter->second << endl;
}

Output is

Value 4 index 3

Value 3 index 2

If you just want the vector indices after sort, use this:

#include <algorithm>
#include <vector>
using namespace std;

vector<double> values;

values.push_back(1.0);
values.push_back(2.0);
values.push_back(3.0);
values.push_back(4.0);

sort(values.rbegin(), values.rend());

The top K entries are indexed by 0 to K-1, and appear in descending order. This uses reverse iterators combined with standard sort (using less<double> to achieve descending order when iterated forward. Equivalently:

sort(values.rbegin(), values.rend(), less<double>());

Sample code for the excellent nth_element solution suggested by @Kiril here (K = 125000, N = 500000). I wanted to try this out, so here it is.

vector<double> values;

for (size_t i = 0; i < 500000; ++i)
{
    values.push_back(rand());
}

nth_element(values.begin(), values.begin()+375000, values.end());
sort(values.begin()+375000, values.end());

vector<double> results(values.rbegin(), values.rbegin() + values.size() - 375000);

Steve Townsend 2010-10-19 19:28:20

Answer 5

+3 A:

you could use the nth_element STL algorithm - this will return you the N greatest elements ( this is the fastest way,using stl ) and then use .sort on them,or you could use the partial_sort algorithm,if you want the first K elements to be sorted (:

Using just .sort is awful - it is very slow for the purpose you want.. .sort is great STL algorithm,but for sorting the whole container,not just the first K elements (; it's not an accident the existung of nth_element and partial_sort ;)

Kiril Kirov 2010-10-19 19:31:04

+1 for finding nth_element. (please link to official docs though)

Jason S 2010-10-19 19:35:47

http://cplusplus.com/reference/algorithm/nth_element , there you can find the partial_sort, too

Kiril Kirov 2010-10-19 19:42:59

@Jason: Where do you find "official docs" online to link to?

GMan 2010-10-19 20:13:18

I guess I meant "formal" rather than "official". Look at SGI or HP (the former is online, the latter since Stepanov was working at HP when STL came out)

Jason S 2010-10-19 20:37:26

Would the usage be like the following? `vector< double > dblVec; nth_element( ` The result is in `dblVec[ 0 ]` to `dblLvec[ K - 1 ]`.

ArunSaha 2010-10-19 23:22:51

using default comparison (operator <): `nth_element( myvector.begin(), myvector.begin() + 666, myvector.end() );` Using function for comparing: `nth_element( myvector.begin(), myvector.begin() + 666, myvector.end(), myfunction );`

Kiril Kirov 2010-10-20 08:24:21

Answer 6

A:

So you actually need a structure that maps indices to corresponding doubles.

You could use std::multimap class to perform this mapping. As Jason have noted std::map does not allow duplicate keys.

std::vector<double> v; // assume it is populated already
std::multimap<double, int> m;
for (int i = 0; i < v.size(); ++i)
    m.insert(std::make_pair(v[i], i));
...

After you've done this you could iterate over first ten elements as map preserves sorting of keys to the elements.

Keynslug 2010-10-19 19:32:12

this doesn't work if there are duplicates.

Jason S 2010-10-19 19:34:16

ansaurus

tags:

views:

answers:

how would I sort a list and get the top K elements? (STL)

related questions