views:

445

answers:

11

Profiling my cpu-bound code has suggested I that spend a long time checking to see if a container contains completely unique elements. Assuming that I have some large container of unsorted elements (with < and = defined), I have two ideas on how this might be done:

The first using a set:

template <class T>
bool is_unique(vector<T> X) {
  set<T> Y(X.begin(), X.end());
  return X.size() == Y.size();
}

The second looping over the elements:

template <class T>
bool is_unique2(vector<T> X) {
  typename vector<T>::iterator i,j;
  for(i=X.begin();i!=X.end();++i) {
    for(j=i+1;j!=X.end();++j) {
      if(*i == *j) return 0;
    }
  }
  return 1;
}

I've tested them the best I can, and from what I can gather from reading the documentation about STL, the answer is (as usual), it depends. I think that in the first case, if all the elements are unique it is very quick, but if there is a large degeneracy the operation seems to take O(N^2) time. For the nested iterator approach the opposite seems to be true, it is lighting fast if X[0]==X[1] but takes (understandably) O(N^2) time if all the elements are unique.

Is there a better way to do this, perhaps a STL algorithm built for this very purpose? If not, are there any suggestions eek out a bit more efficiency?

+5  A: 

The standard library has std::unique, but that would require you to make a copy of the entire container (note that in both of your examples you make a copy of the entire vector as well, since you unnecessarily pass the vector by value).

template <typename T>
bool is_unique(std::vector<T> vec)
{
    std::sort(vec.begin(), vec.end());
    return std::unique(vec.begin(), vec.end()) == vec.end();
}

Whether this would be faster than using a std::set would, as you know, depend :-).

James McNellis
`unique` removes only consecutive duplicates, so this would only work if the vector were sorted.
Fred Larson
@Fred: That's why I sort the copy of the vector.
James McNellis
Asymptotically speaking, this has the same space and time requirements as the `is_unique` in the question. They are both O(n) in space and O(n log n) in time. That is to say, their run times are dominated by sorting (explicit sorting in your example, and the sorting internal to `std::set` in the OP's). My suggestion would be to try both and pick whichever happens to be faster in practice.
Tyler McHenry
I must be getting old. Why couldn't I see that? You didn't slip in an edit just under the 5 min. mark, did you?
Fred Larson
@Tyler actually, no, worse, this is O(n2) runtime and O(n) space. The _worst case_ of `std::sort()` is O(n2).
wilhelmtell
@WilhelmTell: _Technically_ `sort()` has no worst-case requirement at all; it does have the average-case n log n requirement, and if you really thought you'd run into perverse cases where you're not going to get real-world n log n, you can use `stable_sort()`, which does have an n log n upper bound requirement.
James McNellis
@James correct: 25.3.1.1 paragraph 3. The only guarantee is that the average is N*log N. The cplusplus.com website mislead me! http://www.cplusplus.com/reference/algorithm/sort/ says worst case of sort is O(n^2)
wilhelmtell
@WilhelmTell of Purple-Magenta: That's because `std::sort` can use something like introsort which has O(n^2) complexity, while `std::stable_sort` is usually implemented using a form of merge sort.
Billy ONeal
@WilhelmTell: I've come to the conclusion that cplusplus.com is of pretty poor quality. I've reported a half dozen errors or so over the last few months and none of them have been corrected.
James McNellis
This is stupid fast for a simple type <T>==<int>, far faster then any other example (the copying is so cheap). In general this may not be the case for me, but I'll have to keep this mind for future use!
Hooked
@James Seriously? Making a mistake is not nearly as bad as not correcting the mistake! Do they reply at least?
wilhelmtell
@WilhelmTell: I got no response at all, which is what irritated me. I don't mind reporting bugs and explaining why they need to be corrected, but if they're not going to fix them... eh. That said, I still use cplusplus.com when I need a quick lookup on something :-P
James McNellis
@James truth, same here. Even if Google works by voting with the mouse or the HTML. :s
wilhelmtell
+5  A: 

Is it infeasible to just use a container that provides this "guarantee" from the get-go? Would it be useful to flag a duplicate at the time of insertion rather than at some point in the future? When I've wanted to do something like this, that's the direction I've gone; just using the set as the "primary" container, and maybe building a parallel vector if I needed to maintain the original order, but of course that makes some assumptions about memory and CPU availability...

dash-tom-bang
+5  A: 

For one thing you could combine the advantages of both: stop building the set, if you have already discovered a duplicate:

template <class T>
bool is_unique(const std::vector<T>& vec)
{
    std::set<T> test;
    for (typename std::vector<T>::const_iterator it = vec.begin(); it != vec.end(); ++it) {
        if (!test.insert(*it).second) {
            return false;
        }
    }
    return true;
}

BTW, Potatoswatter makes a good point that in the generic case you might want to avoid copying T, in which case you might use a std::set<const T*, dereference_less> instead.


You could of course potentially do much better if it wasn't generic. E.g if you had a vector of integers of known range, you could just mark in an array (or even bitset) if an element exists.

UncleBens
But that's still very expensive because `set` uses dynamic allocations. You're essentially building a `set` when you don't need it. So the solution is correct mathematically, but expensive in practice.
wilhelmtell
@WilhelmTell: Yes, but when you start by sorting the vector, you are gonna have to sort it all, which falls under the same worst case scenario as OP's #1. Also, sorting a vector can be expensive, if T is expensive to swap. - This is about finding a middle way between the worst cases of either approach. All in all, it will depend heavily on the types involved and the nature of the data - how often there are or aren't duplicates.
UncleBens
Adding all the elements to a vector and sorting it is generally faster than inserting elements and keeping sorted order...
fbrereto
I'd have to agree with WilhelmTell based off testing a worst case environment where all the elements are already sorted. In this case it seems to work much _slower_ then the first example, when in theory, it should be about the same speed.
Hooked
Or just create another vector of the same length as the original vector, but made up of T* elements. Then sort those via dereferencing.
dash-tom-bang
+1  A: 

You can use std::unique, but it requires the range to be sorted first:

template <class T>
bool is_unique(vector<T> X) {
  std::sort(X.begin(), X.end());
  return std::unique(X.begin(), X.end()) == X.end();
}

std::unique modifies the sequence and returns an iterator to the end of the unique set, so if that's still the end of the vector then it must be unique.

This runs in nlog(n); the same as your set example. I don't think you can theoretically guarantee to do it faster, although using a C++0x std::unordered_set instead of std::set would do it in expected linear time - but that requires that your elements be hashable as well as having operator == defined, which might not be so easy.

Also, if you're not modifying the vector in your examples, you'd improve performance by passing it by const reference, so you don't make an unnecessary copy of it.

Peter
+9  A: 

Your first example should be O(N log N) as set takes log N time for each insertion. I don't think a faster O is possible.

The second example is obviously O(N^2). The coefficient and memory usage are low, so it might be faster (or even the fastest) in some cases.

It depends what T is, but for generic performance, I'd recommend sorting a vector of pointers to the objects.

template< class T >
bool dereference_less( T const *l, T const *r )
 { return *l < *r; } 

template <class T>
bool is_unique(vector<T> const &x) {
    vector< T const * > vp;
    vp.reserve( x.size() );
    for ( size_t i = 0; i < x.size(); ++ i ) vp.push_back( &x[i] );
    sort( vp.begin(), vp.end(), ptr_fun( &dereference_less<T> ) ); // O(N log N)
    return adjacent_find( vp.begin(), vp.end(),
           not2( ptr_fun( &dereference_less<T> ) ) ) // "opposite functor"
        == vp.end(); // if no adjacent pair (vp_n,vp_n+1) has *vp_n < *vp_n+1
}

or in STL style,

template <class I>
bool is_unique(I first, I last) {
    typedef typename iterator_traits<I>::value_type T;
    …

And if you can reorder the original vector, of course,

template <class T>
bool is_unique(vector<T> &x) {
    sort( x.begin(), x.end() ); // O(N log N)
    return adjacent_find( x.begin(), x.end() ) == x.end();
}
Potatoswatter
Putting things in as pointers is not a bad idea. This could be applied to either of the set based algorithms.
clahey
`std::ajacent_find` is a much better idea than `std::unique`.
James McNellis
Hooked
Hooked
@Hooked: sorry, I really should've tested it. There was a const correctness error (missing a few consts rather than having one extra) and an issue with the functor being incompatible with `not2`… here I switched to the `ptr_fun` functor generator.
Potatoswatter
The "STL-style" prototype is more like the way the functions in `<algorithm>` are written. It takes a pair of iterators rather than a container. Just substitute in those three lines and you should be good to go. (But I didn't test it ;v) .)
Potatoswatter
Faster Big-O can be achieved by using a hash-table based `set` implementation, at the cost of some space overhead. O(1) inserts for n elements == O(n)
tzaman
Thank you for the fix - I've learned a lot from this post!
Hooked
@tzaman: correct, I still forget about hashtables! However the question specifies `operator<` rather than a hash function.
Potatoswatter
+1  A: 

Well, your first one should only take N log(N), so it's clearly the better worse case scenario for this application.

However, you should be able to get a better best case if you check as you add things to the set:

template <class T>
bool is_unique3(vector<T> X) {
  set<T> Y;
  typename vector<T>::const_iterator i;
  for(i=X.begin(); i!=X.end(); ++i) {
    if (Y.find(*i) != Y.end()) {
      return false;
    }
    Y.insert(*i);
  }
  return true;
}

This should have O(1) best case, O(N log(N)) worst case, and average case depends on the distribution of the inputs.

clahey
+2  A: 

You must sort the vector if you want to quickly determine if it has only unique elements. Otherwise the best you can do is O(n^2) runtime or O(n log n) runtime with O(n) space. I think it's best to write a function that assumes the input is sorted.

template<class Fwd>
bool is_unique(In first, In last)
{
    return asjacent_find(first, last) == last;
}

then have the client sort the vector, or a make a sorted copy of the vector. This will open a door for dynamic programming. That is, if the client sorted the vector in the past then they have the option to keep and refer to that sorted vector so they can repeat this operation for O(n) runtime.

wilhelmtell
+1: better than std::unique. Also an implementation in the spirit of STL.
UncleBens
+1  A: 

If the type T You store in Your vector is large and copying it is costly, consider creating a vector of pointers or iterators to Your vector elements. Sort it based on the element pointed to and then check for uniqueness.

You can also use the std::set for that. The template looks like this

template <class Key,class Traits=less<Key>,class Allocator=allocator<Key> > class set

I think You can provide appropriate Traits parameter and insert raw pointers for speed or implement a simple wrapper class for pointers with < operator.

Don't use the constructor for inserting into the set. Use insert method. The method (one of overloads) has a signature

pair <iterator, bool> insert(const value_type& _Val);

By checking the result (second member) You can often detect the duplicate much quicker, than if You inserted all elements.

Maciej Hehl
A: 

Using the current C++ standard containers, you have a good solution in your first example. But if you can use a hash container, you might be able to do better, as the hash set will be n*O(1) instead of n*O(log n) for a standard set. Of course everything will depend on the size of n and your particular library implementation.

Mark Ransom
A hashmap will give you big-theta of 1 and O(n^2).
wilhelmtell
@wilhelmtell: That... does not sound right. Care to share your math? Inserting into a hashmap is supposed to be O(n) amortized. Assuming your hashmap has a way to detect collisions, you should know by the time you inserted the last element whether there's a collision. The only way I can think of to make it O (N^2) is if you assumed the vector checked for collisions on every insert (which I don't think was part of the question) and then only if it threw away the map after each update to the vector.
Jason
I have no clue what that means in my comment. That's O(n), and I blame the cat for that typo.
wilhelmtell
A: 

In the (very) special case of sorting discrete values with a known, not too big, maximum value N.
You should be able to start a bucket sort and simply check that the number of values in each bucket is below 2.

bool is_unique(const vector<int>& X, int N)
{
  vector<int> buckets(N,0);
  typename vector<int>::const_iterator i;
  for(i = X.begin(); i != X.end(); ++i)
    if(++buckets[*i] > 1)
      return false;
  return true;
}

The complexity of this would be O(n).

Ugo
+1  A: 

If I may add my own 2 cents.

First of all, as @Potatoswatter remarked, unless your elements are cheap to copy (built-in/small PODs) you'll want to use pointers to the original elements rather than copying them.

Second, there are 2 strategies available.

  1. Simply ensure there is no duplicate inserted in the first place. This means, of course, controlling the insertion, which is generally achieved by creating a dedicated class (with the vector as attribute).
  2. Whenever the property is needed, check for duplicates

I must admit I would lean toward the first. Encapsulation, clear separation of responsibilities and all that.

Anyway, there are a number of ways depending on the requirements. The first question is:

  • do we have to let the elements in the vector in a particular order or can we "mess" with them ?

If we can mess with them, I would suggest keeping the vector sorted: Loki::AssocVector should get you started. If not, then we need to keep an index on the structure to ensure this property... wait a minute: Boost.MultiIndex to the rescue ?

Thirdly: as you remarked yourself a simple linear search doubled yield a O(N2) complexity in average which is no good.

If < is already defined, then sorting is obvious, with its O(N log N) complexity. It might also be worth it to make T Hashable, because a std::tr1::hash_set could yield a better time (I know, you need a RandomAccessIterator, but if T is Hashable then it's easy to have T* Hashable to ;) )

But in the end the real issue here is that our advises are necessary generic because we lack data.

  • What is T, do you intend the algorithm to be generic ?
  • What is the number of elements ? 10, 100, 10.000, 1.000.000 ? Because asymptotic complexity is kind of moot when dealing with a few hundreds....
  • And of course: can you ensure unicity at insertion time ? Can you modify the vector itself ?
Matthieu M.