ansaurus

Question

Determining if an unordered vector<T> has all unique elements

Answer 1

+5 A:

The standard library has std::unique, but that would require you to make a copy of the entire container (note that in both of your examples you make a copy of the entire vector as well, since you unnecessarily pass the vector by value).

template <typename T>
bool is_unique(std::vector<T> vec)
{
    std::sort(vec.begin(), vec.end());
    return std::unique(vec.begin(), vec.end()) == vec.end();
}

Whether this would be faster than using a std::set would, as you know, depend :-).

James McNellis 2010-05-04 21:47:08

`unique` removes only consecutive duplicates, so this would only work if the vector were sorted.

Fred Larson 2010-05-04 21:52:05

@Fred: That's why I sort the copy of the vector.

James McNellis 2010-05-04 21:53:00

Asymptotically speaking, this has the same space and time requirements as the `is_unique` in the question. They are both O(n) in space and O(n log n) in time. That is to say, their run times are dominated by sorting (explicit sorting in your example, and the sorting internal to `std::set` in the OP's). My suggestion would be to try both and pick whichever happens to be faster in practice.

Tyler McHenry 2010-05-04 21:53:46

I must be getting old. Why couldn't I see that? You didn't slip in an edit just under the 5 min. mark, did you?

Fred Larson 2010-05-04 21:55:23

@Tyler actually, no, worse, this is O(n2) runtime and O(n) space. The _worst case_ of `std::sort()` is O(n2).

wilhelmtell 2010-05-04 22:04:21

@WilhelmTell: _Technically_ `sort()` has no worst-case requirement at all; it does have the average-case n log n requirement, and if you really thought you'd run into perverse cases where you're not going to get real-world n log n, you can use `stable_sort()`, which does have an n log n upper bound requirement.

James McNellis 2010-05-04 22:07:55

@James correct: 25.3.1.1 paragraph 3. The only guarantee is that the average is N*log N. The cplusplus.com website mislead me! http://www.cplusplus.com/reference/algorithm/sort/ says worst case of sort is O(n^2)

wilhelmtell 2010-05-04 22:12:13

@WilhelmTell of Purple-Magenta: That's because `std::sort` can use something like introsort which has O(n^2) complexity, while `std::stable_sort` is usually implemented using a form of merge sort.

Billy ONeal 2010-05-04 22:53:22

@WilhelmTell: I've come to the conclusion that cplusplus.com is of pretty poor quality. I've reported a half dozen errors or so over the last few months and none of them have been corrected.

James McNellis 2010-05-04 23:18:42

This is stupid fast for a simple type <T>==<int>, far faster then any other example (the copying is so cheap). In general this may not be the case for me, but I'll have to keep this mind for future use!

Hooked 2010-05-04 23:33:51

@James Seriously? Making a mistake is not nearly as bad as not correcting the mistake! Do they reply at least?

wilhelmtell 2010-05-05 00:04:01

@WilhelmTell: I got no response at all, which is what irritated me. I don't mind reporting bugs and explaining why they need to be corrected, but if they're not going to fix them... eh. That said, I still use cplusplus.com when I need a quick lookup on something :-P

James McNellis 2010-05-05 01:23:49

@James truth, same here. Even if Google works by voting with the mouse or the HTML. :s

wilhelmtell 2010-05-05 01:28:57

Answer 2

+5 A:

Is it infeasible to just use a container that provides this "guarantee" from the get-go? Would it be useful to flag a duplicate at the time of insertion rather than at some point in the future? When I've wanted to do something like this, that's the direction I've gone; just using the set as the "primary" container, and maybe building a parallel vector if I needed to maintain the original order, but of course that makes some assumptions about memory and CPU availability...

dash-tom-bang 2010-05-04 21:48:25

Answer 3

+5 A:

For one thing you could combine the advantages of both: stop building the set, if you have already discovered a duplicate:

template <class T>
bool is_unique(const std::vector<T>& vec)
{
    std::set<T> test;
    for (typename std::vector<T>::const_iterator it = vec.begin(); it != vec.end(); ++it) {
        if (!test.insert(*it).second) {
            return false;
        }
    }
    return true;
}

BTW, Potatoswatter makes a good point that in the generic case you might want to avoid copying T, in which case you might use a std::set<const T*, dereference_less> instead.

You could of course potentially do much better if it wasn't generic. E.g if you had a vector of integers of known range, you could just mark in an array (or even bitset) if an element exists.

UncleBens 2010-05-04 21:50:02

But that's still very expensive because `set` uses dynamic allocations. You're essentially building a `set` when you don't need it. So the solution is correct mathematically, but expensive in practice.

wilhelmtell 2010-05-04 22:01:09

@WilhelmTell: Yes, but when you start by sorting the vector, you are gonna have to sort it all, which falls under the same worst case scenario as OP's #1. Also, sorting a vector can be expensive, if T is expensive to swap. - This is about finding a middle way between the worst cases of either approach. All in all, it will depend heavily on the types involved and the nature of the data - how often there are or aren't duplicates.

UncleBens 2010-05-04 22:06:30

Adding all the elements to a vector and sorting it is generally faster than inserting elements and keeping sorted order...

fbrereto 2010-05-04 22:18:53

I'd have to agree with WilhelmTell based off testing a worst case environment where all the elements are already sorted. In this case it seems to work much _slower_ then the first example, when in theory, it should be about the same speed.

Hooked 2010-05-04 22:19:49

Or just create another vector of the same length as the original vector, but made up of T* elements. Then sort those via dereferencing.

dash-tom-bang 2010-05-04 23:00:50

Answer 4

+1 A:

You can use std::unique, but it requires the range to be sorted first:

template <class T>
bool is_unique(vector<T> X) {
  std::sort(X.begin(), X.end());
  return std::unique(X.begin(), X.end()) == X.end();
}

std::unique modifies the sequence and returns an iterator to the end of the unique set, so if that's still the end of the vector then it must be unique.

This runs in nlog(n); the same as your set example. I don't think you can theoretically guarantee to do it faster, although using a C++0x std::unordered_set instead of std::set would do it in expected linear time - but that requires that your elements be hashable as well as having operator == defined, which might not be so easy.

Also, if you're not modifying the vector in your examples, you'd improve performance by passing it by const reference, so you don't make an unnecessary copy of it.

Peter 2010-05-04 21:51:22

Answer 5

+9 A:

Your first example should be O(N log N) as set takes log N time for each insertion. I don't think a faster O is possible.

The second example is obviously O(N^2). The coefficient and memory usage are low, so it might be faster (or even the fastest) in some cases.

It depends what T is, but for generic performance, I'd recommend sorting a vector of pointers to the objects.

template< class T >
bool dereference_less( T const *l, T const *r )
 { return *l < *r; } 

template <class T>
bool is_unique(vector<T> const &x) {
    vector< T const * > vp;
    vp.reserve( x.size() );
    for ( size_t i = 0; i < x.size(); ++ i ) vp.push_back( &x[i] );
    sort( vp.begin(), vp.end(), ptr_fun( &dereference_less<T> ) ); // O(N log N)
    return adjacent_find( vp.begin(), vp.end(),
           not2( ptr_fun( &dereference_less<T> ) ) ) // "opposite functor"
        == vp.end(); // if no adjacent pair (vp_n,vp_n+1) has *vp_n < *vp_n+1
}

or in STL style,

template <class I>
bool is_unique(I first, I last) {
    typedef typename iterator_traits<I>::value_type T;
    …

And if you can reorder the original vector, of course,

template <class T>
bool is_unique(vector<T> &x) {
    sort( x.begin(), x.end() ); // O(N log N)
    return adjacent_find( x.begin(), x.end() ) == x.end();
}

Potatoswatter 2010-05-04 21:51:23

Putting things in as pointers is not a bad idea. This could be applied to either of the set based algorithms.

clahey 2010-05-04 21:57:35

`std::ajacent_find` is a much better idea than `std::unique`.

James McNellis 2010-05-04 22:21:56

Hooked 2010-05-04 22:24:01

Hooked 2010-05-04 23:00:17

@Hooked: sorry, I really should've tested it. There was a const correctness error (missing a few consts rather than having one extra) and an issue with the functor being incompatible with `not2`… here I switched to the `ptr_fun` functor generator.

Potatoswatter 2010-05-04 23:18:56

The "STL-style" prototype is more like the way the functions in `<algorithm>` are written. It takes a pair of iterators rather than a container. Just substitute in those three lines and you should be good to go. (But I didn't test it ;v) .)

Potatoswatter 2010-05-04 23:21:26

Faster Big-O can be achieved by using a hash-table based `set` implementation, at the cost of some space overhead. O(1) inserts for n elements == O(n)

tzaman 2010-05-04 23:25:24

Thank you for the fix - I've learned a lot from this post!

Hooked 2010-05-04 23:29:42

@tzaman: correct, I still forget about hashtables! However the question specifies `operator<` rather than a hash function.

Potatoswatter 2010-05-04 23:37:06

Answer 6

+1 A:

Well, your first one should only take N log(N), so it's clearly the better worse case scenario for this application.

However, you should be able to get a better best case if you check as you add things to the set:

template <class T>
bool is_unique3(vector<T> X) {
  set<T> Y;
  typename vector<T>::const_iterator i;
  for(i=X.begin(); i!=X.end(); ++i) {
    if (Y.find(*i) != Y.end()) {
      return false;
    }
    Y.insert(*i);
  }
  return true;
}

This should have O(1) best case, O(N log(N)) worst case, and average case depends on the distribution of the inputs.

clahey 2010-05-04 21:51:44

Answer 7

+2 A:

You must sort the vector if you want to quickly determine if it has only unique elements. Otherwise the best you can do is O(n^2) runtime or O(n log n) runtime with O(n) space. I think it's best to write a function that assumes the input is sorted.

template<class Fwd>
bool is_unique(In first, In last)
{
    return asjacent_find(first, last) == last;
}

then have the client sort the vector, or a make a sorted copy of the vector. This will open a door for dynamic programming. That is, if the client sorted the vector in the past then they have the option to keep and refer to that sorted vector so they can repeat this operation for O(n) runtime.

wilhelmtell 2010-05-04 21:53:26

+1: better than std::unique. Also an implementation in the spirit of STL.

UncleBens 2010-05-04 22:10:11

Answer 8

+1 A:

If the type T You store in Your vector is large and copying it is costly, consider creating a vector of pointers or iterators to Your vector elements. Sort it based on the element pointed to and then check for uniqueness.

You can also use the std::set for that. The template looks like this

template <class Key,class Traits=less<Key>,class Allocator=allocator<Key> > class set

I think You can provide appropriate Traits parameter and insert raw pointers for speed or implement a simple wrapper class for pointers with < operator.

Don't use the constructor for inserting into the set. Use insert method. The method (one of overloads) has a signature

pair <iterator, bool> insert(const value_type& _Val);

By checking the result (second member) You can often detect the duplicate much quicker, than if You inserted all elements.

Maciej Hehl 2010-05-04 21:56:44

Answer 9

A:

Using the current C++ standard containers, you have a good solution in your first example. But if you can use a hash container, you might be able to do better, as the hash set will be n*O(1) instead of n*O(log n) for a standard set. Of course everything will depend on the size of n and your particular library implementation.

Mark Ransom 2010-05-04 22:01:38

A hashmap will give you big-theta of 1 and O(n^2).

wilhelmtell 2010-05-04 22:07:49

@wilhelmtell: That... does not sound right. Care to share your math? Inserting into a hashmap is supposed to be O(n) amortized. Assuming your hashmap has a way to detect collisions, you should know by the time you inserted the last element whether there's a collision. The only way I can think of to make it O (N^2) is if you assumed the vector checked for collisions on every insert (which I don't think was part of the question) and then only if it threw away the map after each update to the vector.

Jason 2010-05-05 00:48:51

I have no clue what that means in my comment. That's O(n), and I blame the cat for that typo.

wilhelmtell 2010-05-05 01:14:36

Answer 10

A:

In the (very) special case of sorting discrete values with a known, not too big, maximum value N.
You should be able to start a bucket sort and simply check that the number of values in each bucket is below 2.

bool is_unique(const vector<int>& X, int N)
{
  vector<int> buckets(N,0);
  typename vector<int>::const_iterator i;
  for(i = X.begin(); i != X.end(); ++i)
    if(++buckets[*i] > 1)
      return false;
  return true;
}

The complexity of this would be O(n).

Ugo 2010-05-04 22:18:45

Answer 11

+1 A:

If I may add my own 2 cents.

First of all, as @Potatoswatter remarked, unless your elements are cheap to copy (built-in/small PODs) you'll want to use pointers to the original elements rather than copying them.

Second, there are 2 strategies available.

Simply ensure there is no duplicate inserted in the first place. This means, of course, controlling the insertion, which is generally achieved by creating a dedicated class (with the vector as attribute).
Whenever the property is needed, check for duplicates

I must admit I would lean toward the first. Encapsulation, clear separation of responsibilities and all that.

Anyway, there are a number of ways depending on the requirements. The first question is:

do we have to let the elements in the vector in a particular order or can we "mess" with them ?

If we can mess with them, I would suggest keeping the vector sorted: Loki::AssocVector should get you started. If not, then we need to keep an index on the structure to ensure this property... wait a minute: Boost.MultiIndex to the rescue ?

Thirdly: as you remarked yourself a simple linear search doubled yield a O(N²) complexity in average which is no good.

If < is already defined, then sorting is obvious, with its O(N log N) complexity. It might also be worth it to make T Hashable, because a std::tr1::hash_set could yield a better time (I know, you need a RandomAccessIterator, but if T is Hashable then it's easy to have T* Hashable to ;) )

But in the end the real issue here is that our advises are necessary generic because we lack data.

What is T, do you intend the algorithm to be generic ?
What is the number of elements ? 10, 100, 10.000, 1.000.000 ? Because asymptotic complexity is kind of moot when dealing with a few hundreds....
And of course: can you ensure unicity at insertion time ? Can you modify the vector itself ?

Matthieu M. 2010-05-05 06:49:15

ansaurus

tags:

views:

answers:

Determining if an unordered vector<T> has all unique elements

related questions