views:

1181

answers:

7

I have a vector containing few non-adjacent duplicates.

As a simple example, consider:

2 1 6 1 4 6 2 1 1

I am trying to make this vector unique by removing the non-adjacent duplicates and maintaining the order of elements.

Result would be:

2 1 6 4

The solutions I tried are:

  1. Inserting into a std::set but the problem with this approach is that it will disturb the order of elements.
  2. Use the combination of std::sort and std::unique. But again same order problem.
  3. Manual duplicate elimination:

        Define a temporary vector TempVector.
        for (each element in a vector)
        {
            if (the element does not exists in TempVector)
            {
             add to TempVector;
            }
        }
        swap orginial vector with TempVector.
    

My question is:

Is there any STL algorithm which can remove the non-adjacent duplicates from the vector ? what is its complexity?

+4  A: 

I think you would do it like this:

I would use two iterators on the vector :

The first of one reads the data and inserts it a temporary set.

When the read data was not in the set you copy it from the first iterator to the second and increment it.

At the end you keep only the data up to the second iterator.

The complexity is O( n .log( n ) ) as the lookup for duplicated elements uses the set, not the vector.

#include <vector>
#include <set>
#include <iostream>

int main(int argc, char* argv[])
{
    std::vector< int > k ;

    k.push_back( 2 );
    k.push_back( 1 );
    k.push_back( 6 );
    k.push_back( 1 );
    k.push_back( 4 );
    k.push_back( 6 );
    k.push_back( 2 );
    k.push_back( 1 );
    k.push_back( 1 );

{
 std::vector< int >::iterator r , w ;

 std::set< int > tmpset ;

 for( r = k.begin() , w = k.begin() ; r != k.end() ; ++r )
 {
  if( tmpset.insert( *r ).second )
  {
   *w++ = *r ;
  }
 }

 k.erase( w , k.end() );
}


    {
     std::vector< int >::iterator r ;

     for( r = k.begin() ; r != k.end() ; ++r )
     {
      std::cout << *r << std::endl ;
     }
    }
}
fa.
Using `find` and then `insert` is inefficient. `tmpset.insert(*r).second` will be true if the value was inserted, and false if the value was already in the set.
cjm
I forgot about that, I correct it
fa.
It's about time that we're allowed to write `std::vector<int> k( {1,2,3} );`...
xtofl
Other than that (now corrected) thinko, I like this algorithm.
cjm
@xtofl try boost::assign ;)
AraK
This is a reasonable space/speed tradeoff. An alternative may be an unordered_set (hash) to keep track of values seen before, which would be O(N). But would it hurt to spell out `read` and `write` for `r` and `w` ? This code is simple enough that good names would make comments redundant. Also, `r` should be a `const_iterator`; it doesn't modify `k`
MSalters
A: 

My question is:

Is there any STL algorithm which can remove the non-adjacent duplicates from the vector ? what is its complexity?

The STL options are the ones you mentioned: put the items in a std::set, or call std::sort, std::unique and std::erase. Neither of these fulfills your requirement of "removing the non-adjacent duplicates and maintaining the order of elements."

So why doesn't the STL offer some other option? No standard library will offer everything for every user's needs. The STL's design considerations include "be fast enough for nearly all users," "be useful for nearly all users," and "provide exception safety as much as possible" (and "be small enough for the Standards Committee" as the library Stepanov originally wrote was much larger, and Stroustrup axed out something like 2/3 of it).

The simplest solution I can think of would look like this:

// Note:  an STL-like method would be templatized and use iterators instead of
// hardcoding std::vector<int>
std::vector<int> stable_unique(std::vector<int> input)
{
    std::vector<int> result;
    result.reserve(input.size());
    for (std::vector<int>::iterator itor = input.begin();
                                    itor != input.end();
                                    ++itor)
        if (std::find(result.begin(), result.end(), *itor) == result.end())
            result.push_back(*itor);
        return result;
}

This solution should be O(M*N) where M is the number of unique elements and N is the total number of elements.

Max Lybbert
MSalters
+2  A: 

There's no STL algorithm doing what you want preserving the sequence's original order.

You could create a std::set of iterators or indexes into the vector, with a comparison predicate that uses the referenced data rather than the iterators/indexes to sort stuff. Then you delete everything from the vector that isn't referenced in the set. (Of course, you could just as well use another std::vector of iterators/indexes, std::sort and std::unique that, and use this as a reference as to what to keep.)

sbi
+1  A: 

As far as i know there is none in stl. Look up reference.

Yelonek
+4  A: 

You can remove some of the loops in fa's answer using remove_copy_if:

class NotSeen : public std::unary_function <int, bool>
{
public:
  NotSeen (std::set<int> & seen) : m_seen (seen) { }

  bool operator ()(int i) const  {
    return (m_seen.insert (i).second);
  }

private:
  std::set<int> & m_seen;
};

void removeDups (std::vector<int> const & iv, std::vector<int> & ov) {
  std::set<int> seen;
  std::remove_copy_if (iv.begin ()
      , iv.end ()
      , std::back_inserter (ov)
      , NotSeen (seen));
}

This has no affect on the complexity of the algorithm (ie. as written it's also O(n log n)). You can improve upon this using unordered_set, or if the range of your values is small enough you could simply use an array or a bitarray.

Richard Corden
Is U_STD a macro from before you knew for sure what the std:: namespace was going to be called?
Steve Jessop
@onebyone: Almost! It's a macro that we really don't need to use anymore and goes back to when we used the older g++ compilers ( <= 2.95.3). Force of habit has me writing it everywhere still!
Richard Corden
A: 

Without using a temporary set it's possible to do this with (possibly) some loss of performance:

template<class Iterator>
Iterator Unique(Iterator first, Iterator last)
{
    while (first != last)
    {
        Iterator next(first);
        last = std::remove(++next, last, *first);
        first = next;
    }

    return last;
}

used as in:

vec.erase( Unique( vec.begin(), vec.end() ), vec.end() );

For smaller data sets, the implementation simplicity and lack of extra allocation required may offset the theoretical higher complexity of using an additional set. Measurement with a representative input is the only way to be sure, though.

Charles Bailey
Excellent solution. I would suggest the name Unique is not a revealing as I'd like. How about RemoveNonUnique?
Andy Balaam
+1  A: 

There is a nice article by John Torjo which deals with this very question in a systematic way. The result he comes up with seems more generic and more efficient than any of the solutions suggested here so far:

http://www.builderau.com.au/program/java/soa/C-Removing-duplicates-from-a-range/0,339024620,320271583,00.htm

http://articles.techrepublic.com.com/5100-10878_11-1052159.html

Unfortunately, the complete code of John's solution seems to be no longer available, and John did not respond to may email. Therefore, I wrote my own code which is based on similar grounds like his, but intentionally differs in some details. Feel free to contact me (vschoech think-cell com) and discuss the details if you wish.

To make the code compile for you, I added some of my own library stuff which I use regularly. Also, instead of going with plain stl, I use boost a lot to create more generic, more efficient, and more readable code.

Have fun!

#include <vector>
#include <functional>

#include <boost/bind.hpp>
#include <boost/range.hpp>
#include <boost/iterator/counting_iterator.hpp>

/////////////////////////////////////////////////////////////////////////////////////////////
// library stuff

template< class Rng, class Func >
Func for_each( Rng& rng, Func f ) {
    return std::for_each( boost::begin(rng), boost::end(rng), f );
};

template< class Rng, class Pred >
Rng& sort( Rng& rng, Pred pred ) {
    std::sort( boost::begin( rng ), boost::end( rng ), pred );
    return rng; // to allow function chaining, similar to operator+= et al.
}

template< class T >
boost::iterator_range< boost::counting_iterator<T> > make_counting_range( T const& tBegin, T const& tEnd ) {
    return boost::iterator_range< boost::counting_iterator<T> >( tBegin, tEnd );
}

template< class Func >
class compare_less_impl {
private:
    Func m_func;
public:
    typedef bool result_type;
    compare_less_impl( Func func ) 
    :   m_func( func )
    {}
    template< class T1, class T2 > bool operator()( T1 const& tLeft, T2 const& tRight ) const {
        return m_func( tLeft ) < m_func( tRight );
    }
};

template< class Func >
compare_less_impl<Func> compare_less( Func func ) {
    return compare_less_impl<Func>( func );
}


/////////////////////////////////////////////////////////////////////////////////////////////
// stable_unique

template<class forward_iterator, class predicate_type>
forward_iterator stable_unique(forward_iterator itBegin, forward_iterator itEnd, predicate_type predLess) {
    typedef std::iterator_traits<forward_iterator>::difference_type index_type;
    struct SIteratorIndex {
        SIteratorIndex(forward_iterator itValue, index_type idx) : m_itValue(itValue), m_idx(idx) {}
        std::iterator_traits<forward_iterator>::reference Value() const {return *m_itValue;}
        index_type m_idx;
    private:
        forward_iterator m_itValue;
    };

    // {1} create array of values (represented by iterators) and indices
    std::vector<SIteratorIndex> vecitidx;
    vecitidx.reserve( std::distance(itBegin, itEnd) );
    struct FPushBackIteratorIndex {
        FPushBackIteratorIndex(std::vector<SIteratorIndex>& vecitidx) : m_vecitidx(vecitidx) {}
        void operator()(forward_iterator itValue) const {
            m_vecitidx.push_back( SIteratorIndex(itValue, m_vecitidx.size()) );
        }
    private:
        std::vector<SIteratorIndex>& m_vecitidx;
    };
    for_each( make_counting_range(itBegin, itEnd), FPushBackIteratorIndex(vecitidx) );

    // {2} sort by underlying value
    struct FStableCompareByValue {
        FStableCompareByValue(predicate_type predLess) : m_predLess(predLess) {}
        bool operator()(SIteratorIndex const& itidxA, SIteratorIndex const& itidxB) {
            return m_predLess(itidxA.Value(), itidxB.Value())
                // stable sort order, index is secondary criterion
                || !m_predLess(itidxB.Value(), itidxA.Value()) && itidxA.m_idx < itidxB.m_idx;
        }
    private:
        predicate_type m_predLess;
    };
    sort( vecitidx, FStableCompareByValue(predLess) );

    // {3} apply std::unique to the sorted vector, removing duplicate values
    vecitidx.erase(
        std::unique( vecitidx.begin(), vecitidx.end(),
            !boost::bind( predLess,
                // redundand boost::mem_fn required to compile
                boost::bind(boost::mem_fn(&SIteratorIndex::Value), _1),
                boost::bind(boost::mem_fn(&SIteratorIndex::Value), _2)
            )
        ),
        vecitidx.end()
    );

    // {4} re-sort by index to match original order
    sort( vecitidx, compare_less(boost::mem_fn(&SIteratorIndex::m_idx)) );

    // {5} keep only those values in the original range that were not removed by std::unique
    std::vector<SIteratorIndex>::iterator ititidx = vecitidx.begin();
    forward_iterator itSrc = itBegin;
    index_type idx = 0;
    for(;;) {
        if( ititidx==vecitidx.end() ) {
            // {6} return end of unique range
            return itSrc;
        }
        if( idx!=ititidx->m_idx ) {
            // original range must be modified
            break;
        }
        ++ititidx;
        ++idx;
        ++itSrc;
    }
    forward_iterator itDst = itSrc;
    do {
        ++idx;
        ++itSrc;
        // while there are still items in vecitidx, there must also be corresponding items in the original range
        if( idx==ititidx->m_idx ) {
            std::swap( *itDst, *itSrc ); // C++0x move
            ++ititidx;
            ++itDst;
        }
    } while( ititidx!=vecitidx.end() );

    // {6} return end of unique range
    return itDst;
}

template<class forward_iterator>
forward_iterator stable_unique(forward_iterator itBegin, forward_iterator itEnd) {
    return stable_unique( itBegin, itEnd, std::less< std::iterator_traits<forward_iterator>::value_type >() );
}

void stable_unique_test() {
    std::vector<int> vecn;
    vecn.push_back(1);
    vecn.push_back(17);
    vecn.push_back(-100);
    vecn.push_back(17);
    vecn.push_back(1);
    vecn.push_back(17);
    vecn.push_back(53);
    vecn.erase( stable_unique(vecn.begin(), vecn.end()), vecn.end() );
    // result: 1, 17, -100, 53
}
vschoech
Interesting but... sorting means O(N*log N) while O(N) solution based on Hash Set / Bloom Filters exist.
Matthieu M.