views:

2056

answers:

9

I need to take a C++ vector with potentially a lot of elements, erase duplicates, and sort it. Looks like this code will do the trick: (Correction--it won't; next time I'll test before posting. Thanks for the feedback.)

vec.erase(
      std::unique(vec.begin(), vec.end()),
      vec.end());
std::sort(vec.begin(), vec.end());

Is it faster to erase the duplicates first (as coded above) or perform the sort first? If I do perform the sort first, is it guaranteed to remain sorted after std::unique is executed?

Or is there a more efficient way to do all this?

+7  A: 

std::unique only works on consecutive runs of duplicate elements, so you'd better sort first. However, it is stable, so your vector will remain sorted.

David Seiler
+3  A: 

You need to sort it before you call unique because unique only removes duplicates that are next to each other.

edit: 38 seconds...

David Johnstone
+3  A: 

unique only removes consecutive duplicate elements (which is necessary for it to run in linear time), so you should perform the sort first. It will remain sorted after the call to unique.

Peter
+7  A: 

std::unique only removes duplicate elements if they're neighbours: you have to sort the vector first before it will work as you intend.

std::unique is defined to be stable, so the vector will still be sorted after running unique on it.

jskinner
+2  A: 

As already stated, unique requires a sorted container. Additionally, unique doesn't actually remove elements from the container. Instead, they are copied to the end, unique returns an iterator pointing to the first such duplicate element, and you are expected to call erase to actually remove the elements.

Max Lybbert
Does unique require a sorted container, or does it simply only rearrange the input sequence so it contains no adjacent duplicates? I thought the latter.
Roger Pate
@Pate, you're correct. It doesn't require one. It removes adjacent duplicates.
sharth
If you have a container that may have duplicates, and you want a container that doesn't have any duplicated values anywhere in the container then you must first sort the container, then pass it to unique, and then use erase to actually remove the duplicates. If you simply want to remove adjacent duplicates, then you won't have to sort the container. But you will end up with duplicated values: 1 2 2 3 2 4 2 5 2 will be changed to 1 2 3 2 4 2 5 2 if passed to unique without sorting, 1 2 3 4 5 if sorted, passed to unique and erase.
Max Lybbert
+13  A: 

I'm not sure what you are using this for, so I can't say this with 100% certainty, but normally when I think "sorted, unique" container, I think of a std::set. It might be a better fit for your usecase:

std::set<Foo> foos(vec.begin(), vec.end()); // both sorted & unique already

Otherwise, sorting prior to calling unique (as the other answers pointed out) is the way to go.

Todd Gardner
Well to the point! std::set is specified to be a sorted unique set. Most implementations use an efficient ordered binary tree or something similar.
notnoop
+1 Thought of set as well. Didnt want to duplicate this answer
Tom
Is std::set guaranteed to be sorted? It makes sense that in practice it would be, but does the standard require it?
MadCoder
Yup, see 23.1.4.9 "The fundamental property of iterators of associative containers is that they iterate through the containersin the non-descending order of keys where non-descending is defined by the comparison that was used toconstruct them"
Todd Gardner
@MadCoder: It doesn't necessarily "make sense" that a set is implemented in a way that is sorted. There are also sets implemented using hash tables, which are not sorted. In fact, most people prefer using hash tables when available. But the naming convention in C++ just so happens that the sorted associative containers are named simply "set" / "map" (analogous to TreeSet / TreeMap in Java); and the hashed associative containers, which were left out of the standard, are called "hash_set" / "hash_map" (SGI STL) or "unordered_set" / "unordered_map" (TR1) (analogous to HashSet and HashMap in Java)
newacct
+3  A: 

Efficiency is a complicated concept. There's time vs. space considerations, as well as general measurements (where you only get vague answers such as O(n)) vs. specific ones (e.g. bubble sort can be much faster than quicksort, depending on input characteristics).

If you have relatively few duplicates, then sort followed by unique and erase seems the way to go. If you had relatively many duplicates, creating a set from the vector and letting it do the heavy lifting could easily beat it.

Don't just concentrate on time efficiency either. Sort+unique+erase operates in O(1) space, while the set construction operates in O(n) space. And neither directly lends itself to a map-reduce parallelization (for really huge datasets).

Roger Pate
What would give you map/reduce ability? The only one I can think of is a distributed merge sort and you can still only use one thread in the final merge.
Zan Lynx
Yes, you must have one controlling node/thread. However, you can divide the problem as many times as required to place upper limits on the number of worker/child threads the controlling/parent thread deals with, and on the size of the dataset each leaf node must process. Not all problems are easily solvable with map-reduce, I simply wanted to point out there are people who deal with similar (on the surface, anyway) optimization issues, where dealing with 10 terabytes of data is called "Tuesday".
Roger Pate
+11  A: 
Nate Kohl
+1 Nice evidence based rationale
Faisal Vali
Why convert back to a vector?
Logan Capaldo
I'm shocked that the constructor approach is consistently measurably worse than manual. You would that apart from some tiny constant overhead, it would just do the manual thing. Can anyone explain this?
Ari
Question:Could this line in your first approach:vec.erase( unique( vec.begin(), vec.end() ), vec.end() );be replaced with:with:vec.resize( unique (vec.begin(), vec.end()) - vec.begin() )and if so, would it be any faster in large vector?
Dan
@Dan: Sure, I think resize would work. It didn't seem any faster on a little test data, but a more thorough experiment might prove otherwise.
Nate Kohl
Cool, thanks for the graph. Could you give a sense of what the units are for Number of Duplicates? (ie, around how large is "large enough")?
Kyle Ryan
@Kyle: It's pretty large. I used datasets of 1,000,000 randomly drawn integers between 1 and 1000, 100, and 10 for this graph.
Nate Kohl
+2  A: 

Here's a template to do it for you:

template<typename T>
void removeDuplicates(std::vector<T>& vec)
{
    std::sort(vec.begin(), vec.end());
    vec.erase(std::unique(vec.begin(), vec.end()), vec.end());
}

call it like:

removeDuplicates<int>(vectorname);
DShook
+1 Templatize away! - but you can just write removeDuplicates(vec), without explicitly specifying the template arguments
Faisal Vali
Or even better, just have it take templated iterators directly (begin and end), and you can run it on other structures besides a vector.
Kyle Ryan