ansaurus

Question

How to keep only duplicates efficiently?

Answer 1

+2 A:

My suggestion would be a modified insertion sort, so that you can sort & filter dupes at the same time.

James Curran 2010-03-31 03:27:55

That would be ideal.

Marc Eaddy 2010-03-31 18:59:28

Answer 2

+1 A:

I think that from a big O standpoint, you have it implemented as good as it gets. The overriding cost is the sort, which is O(N log N). One possibility, though, would be to build up a new vector with the duplicate entries rather than use the existing vector with the delete operation removing the non-duplicates. However, this would only be better if the distinct number of duplicates is small relative to the total number of entries.

Consider the extreme example. If the original array consisted of 1,000 entries with only one duplicate, then the output would be a vector with just one value. It might be a bit more efficient to create the new vector with one entry rather than deleting the other 999 entries from the original vector. I suspect, however, that in real world testing, the savings of that change could be difficult to measure.

Edit I was just thinking about this in terms of "interview" question. In other words, this is not a terribly useful answer. But it would be possible to solve this in O(N) (linear time) as opposed to O(N Log N). Use storage space instead of CPU. Create two "bit" arrays with them cleared initially. Loop through your vector of integer values. Look up each value in the first bit array. If it is not set, then set the bit (set it to 1). If it is set, then set the corresponding bit in the second array (indicating a duplicate). After all vector entries are processed, scan through the second array and output the integers that are duplicates (indicated by the bits set in the second bit array). The reason for using bit arrays is just for space efficiency. If dealing with 4-byte integers, then the raw space required is (2 * 2^32 / 8 ). But this could actually be turned into a usable algorithm by making it a sparse array. The very pseudo pseudo-code would be something like this:

bitarray1[infinite_size];
bitarray2[infinite_size];

clear/zero bitarrays

// NOTE - do not need to sort the input
foreach value in original vector {
    if ( bitarray1[value] ) 
        // duplicate
        bitarray2[value] = 1
    bitarray1[value] = 1
}

// At this point, bitarray2 contains a 1 for all duplicate values.
// Scan it and create the new vector with the answer
for i = 0 to maxvalue
    if ( bitarray2[i] )
        print/save/keep i

Mark Wilkins 2010-03-31 03:48:47

Answer 3

+1 A:

Calling "erase(it_start + keep, it_stop);" from within the while loop is going to result in copying the remaining elements over and over again.

I'd suggest swapping all unique elements to the front of the vector, then erasing the remaining elements all at once.

int num_repeats(vector<int>::const_iterator curr, vector<int>::const_iterator end) {
  int same = *curr;
  int count = 0;
  while (curr != end && same == *curr) {
    ++curr;
    ++count;
  }
  return count;
}

void dups(vector<int> *v) {
  sort(v->begin(), v->end());
  vector<int>::iterator current = v->begin();
  vector<int>::iterator end_of_dups = v->begin();
  while (current != v->end()) {
    int n = num_repeats(current, v->end());
    if (n > 1) {
      swap(*end_of_dups, *current);
      end_of_dups++;
    }
    current += n;
  }
  v->erase(end_of_dups, v->end());
}

Stephen 2010-03-31 03:49:02

Answer 4

+5 A:

I miserably failed with my first attempt, assuming that std::unique moves all the duplicates to the end of the range (it doesn't). Oops. Here's another attempt:

Here is an implementation of not_unique. It removes any elements that appear only once in the sorted range and duplicates of any elements that appear more than once. The resulting range is therefore the unique list of elements that appear more than once.

The function has linear complexity and does a single pass over the range (std::unique has linear complexity). It must meet the requirements of a forward iterator. The end of the resulting range is returned.

template <typename It>
It not_unique(It first, It last)
{
    if (first == last) { return last; }

    It new_last = first;
    for (It current = first, next = ++first; next != last; ++current, ++next)
    {
        if (*current == *next)
        {
            if (current == new_last)
            {
                ++new_last;
            }
            else
            {
                *new_last++ = *current;
                while (next != last && *current == *next)
                {
                    ++current;
                    ++next;
                }
                if (next == last)
                    return new_last;
            }
        }
    }
    return new_last;
}

James McNellis 2010-03-31 03:51:21

+1 I hope you don't mind, but I gave it the whole shebang in my answer. (That said, on what I was writing I had previous/current, not current/next, so I kept that. But otherwise you wrote the inner part.)

GMan 2010-03-31 04:11:28

When the range should be sorted, I usually prefer to add a `is_sorted` at the beginning (just in case...). It can be written quite easily using `adjacent_find` and the reversed binary predicate.

Matthieu M. 2010-03-31 07:27:21

@Matthieu: It's a precondition of calling the function that the range is sorted (`std::unique` has the same precondition). I'd agree that a debug assertion would be useful for catching logic errors, though. @GMan: I don't mind at all. Looks nice.

James McNellis 2010-03-31 13:50:37

This doesn't work for { 4, 4, 1, 2, 3, 4, 2, 4, 3 }. Output should be { 2, 3, 4 } but is instead { 2, 3, 4, 4, 4 }.

Marc Eaddy 2010-03-31 18:43:51

@Marc: Good catch. I shouldn't write code in the middle of the night :-). The comparison `*current != *new_last` was not valid, because `*new_last` will never be a valid part of the result range. This could be trivially corrected by comparing `*current != *(new_last - 1)`, but then we would require random access iterators. I've updated the algorithm to fix it so that it does work for forward iterators, but now it's a bit convoluted :-O. I'll likely have some time to take a look at it tonight and clean it up.

James McNellis 2010-03-31 19:49:09

Please take a look at my solution, which I adapted from your original one, and let me know if it works.

Marc Eaddy 2010-03-31 20:41:20

@Marc: The only downside of your adaptation is that it requires random-access iterators, so it's not particularly generic. I have to say I like Jan's answer better than my own. It's much more straightforward.

James McNellis 2010-03-31 21:44:15

Answer 5

+3 A:

This is in the style of the standard library. Credit for algorithm goes to James! (If you +1 me, you better +1 him, or else). All I did was make it standard library style:

#include <algorithm>
#include <functional>
#include <iostream>
#include <iterator>
#include <vector>

// other stuff (not for you)
template <typename T>
void print(const char* pMsg, const T& pContainer)
{
    std::cout << pMsg << "\n    ";
    std::copy(pContainer.begin(), pContainer.end(),
        std::ostream_iterator<typename T::value_type>(std::cout, " "));
    std::cout << std::endl;
}

template <typename T, size_t N>
T* endof(T (&pArray)[N])
{
    return &pArray[0] + N;
}

// not_unique functions (for you)
template <typename ForwardIterator, typename BinaryPredicate>
ForwardIterator not_unique(ForwardIterator pFirst, ForwardIterator pLast,
                           BinaryPredicate pPred)
{
    // correctly handle case where an empty range was given:
    if (pFirst == pLast) 
    { 
        return pLast; 
    }

    ForwardIterator result = pFirst;
    ForwardIterator previous = pFirst;

    for (++pFirst; pFirst != pLast; ++pFirst, ++previous)
    {
        // if equal to previous
        if (pPred(*pFirst, *previous))
        {
            if (previous == result)
            {
                // if we just bumped bump again
                ++result;
            }
            else if (!pPred(*previous, *result))
            {
                // if it needs to be copied, copy it
                *result = *previous;

                // bump
                ++result;
            }
        }
    }

    return result;
}

template <typename ForwardIterator>
ForwardIterator not_unique(ForwardIterator pFirst, ForwardIterator pLast)
{
    return not_unique(pFirst, pLast,
                std::equal_to<typename ForwardIterator::value_type>());
}


//test
int main()
{
    typedef std::vector<int> vec;

    int data[] = {1, 4, 7, 7, 2, 2, 2, 3, 9, 9, 5, 4, 2, 8};
    vec v(data, endof(data));

    // precondition
    std::sort(v.begin(), v.end());
    print("before", v);

    // duplicatify (it's a word now)
    vec::iterator iter = not_unique(v.begin(), v.end());
    print("after", v);

    // remove extra
    v.erase(iter, v.end());
    print("erased", v);
}

GMan 2010-03-31 04:07:16

+1 for taking the trouble to making it STL compliant.

Matthieu M. 2010-03-31 07:28:03

The only thing that buggers me in james algorithm is that we do not check if it's actually sorted or not. However by requiring that the binary predicate is the one used by the `sort` operation (instead of an equality predicate) we could achieve it.

Matthieu M. 2010-03-31 07:30:34

@Matthieu: Thanks. And meh, it's a precondition. Just like in `unique`.

GMan 2010-03-31 08:18:21

I added the guard clause to handle the case where `first == last` so that the semantics match those of `unique` for empty ranges. Other than that, it looks really good.

James McNellis 2010-03-31 14:00:46

This doesn't work for { 4, 4, 1, 2, 3, 4, 2, 4, 3 }. Output should be { 2, 3, 4 } but is instead { 2, 3, 4, 4, 4 }.

Marc Eaddy 2010-03-31 18:47:15

Answer 6

+3 A:

You can even use mismatch, for extra points!
Btw: nice exercise.

template<class TIter>
/** Moves duplicates to front, returning end of duplicates range.
 *  Use a sorted range as input. */
TIter Duplicates(TIter begin, TIter end) {
    TIter dup = begin;
    for (TIter it = begin; it != end; ++it) {
        TIter next = it;
        ++next;
        TIter const miss = std::mismatch(next, end, it).second;
        if (miss != it) {
            *dup++ = *miss;
            it = miss;
        }
    }
    return dup;
}

Jan 2010-03-31 08:30:20

Answer 7

+1 A:

Another one:

template <typename T>
void keep_duplicates(vector<T>& v) 
{
    set<T> 
        u(v.begin(), v.end()), // unique 
        d; // duplicates
    for (size_t i = 0; i < v.size(); i++)
        if (u.find(v[i]) != u.end())
            u.erase(v[i]);
        else
            d.insert(v[i]);

    v = vector<T>(d.begin(), d.end());
}

cheers, AR

Alain Rist 2010-03-31 10:16:17

Good solution, but not memory or space efficient when n is large (10B). In the case where all elements are unique, u is an identical copy of v (think of all the dynamic allocations!). Creating u is O(n log n) and the for loop is O(n log n).

Marc Eaddy 2010-03-31 18:57:53

Alain Rist 2010-03-31 20:38:42

Answer 8

A:

This fixes the bugs in James McNellis's original version. I also provide in-place and out-of-place versions.

// In-place version.  Uses less memory and works for more container
// types but is slower.
template <typename It>
It not_unique_inplace(It first, It last)
{
    if (first == last)
        return last;

    It new_last = first;
    for (It current = first, next = first + 1; next != last; ++current, ++next)
    {
        if (*current == *next && 
            (new_last == first || *current != *(new_last-1)))
            *new_last++ = *current;
    }
    return new_last;
}

// Out-of-place version. Fastest.
template <typename It, typename Container>
void not_unique(It first, It last, Container pout)
{
    if (first == last || !pout)
        return;

    for (It current = first, next = first + 1; next != last; ++current, ++next)
    {
        if (*current == *next && 
            (pout->empty() || *current != pout->back()))
            pout->push_back(*current);
    }
}

Marc Eaddy 2010-04-01 00:38:37

Answer 9

A:

What is meant by "as efficient as std::unique"? Efficient in terms of runtime, development time, memory usage, or what?

As others pointed out, std::unique requires sorted input, which you haven't provided, so it's not a fair test to begin with.

Personally I would just have a std::map do all of my work for me. It has a lot of properties we can use for maximal elegance/brevity. It keeps its elements sorted already, and operator[] will insert a zero value if the key doesn't already exist. By leveraging those properties, we can get this done in two or three lines of code, and still achieve reasonable runtime complexity.

Basically, my algorithm is this: For each element in the vector, increment by one the map entry keyed by the value of that element. Afterwards, simply walk the map, outputting any key whose value is more than 1. Couldn't be simpler.

#include <iostream>
#include <vector>
#include <map>

void
output_sorted_duplicates(std::vector<int>* v)
{
   std::map<int, int> m;  

   // count how many of each element there are, putting results into map
   // map keys are elements in the vector, 
   // map values are the frequency of that element
   for (std::vector<int>::iterator vb = v->begin(); vb != v->end(); ++vb)
      ++m[*vb];

   // output keys whose values are 2 or more
   // the keys are already sorted by the map
   for (std::map<int, int>::iterator mb = m.begin(); mb != m.end(); ++mb)
      if ( (*mb).second >= 2 ) 
         std::cout << (*mb).first << " "; 
   std::cout << std::endl;
}

int main(void) 
{ 
   int initializer[] = { 4, 4, 1, 2, 3, 2, 3 };
   std::vector<int> data(&initializer[0], &initializer[0] + 7);
   output_sorted_duplicates(&data);
}

janks@phoenix:/tmp$ g++ test.cc && ./a.out
2 3 4

So, we visit each element in your vector once, and then each element in my map once, where the number of elements in my map is at worst no bigger than your vector. The drawbacks to my solution are a lot more storage space than the solutions that involve rearranging your vector in-place. The advantages, however, are clear. It's incredibly short and simple, it's obviously correct without the need for much testing or code review, and it has reasonable performance properties.

Making my function a template, and making it operate on STL-style ranges instead of just vectors of ints, is left as an exercise.

janks 2010-04-01 01:13:37

Answer 10

+2 A:

Shorter and more STL-ish than previous entries. Assumes sorted input.

#include <algorithm>
#include <functional>

template< class I, class P >
I remove_unique( I first, I last, P pred = P() ) {
    I dest = first;
    while (
        ( first = std::adjacent_find( first, last, pred ) )
            != last ) {
        * dest = * first;
        ++ first;
        ++ dest;
        if ( ( first = std::adjacent_find( first, last, std::not2( pred ) ) )
            == last ) break;
        ++ first;
    }
    return dest;
}

template< class I >
I remove_unique( I first, I last ) {
    return remove_unique( first, last,
        std::equal_to< typename std::iterator_traits<I>::value_type >() );
}

Potatoswatter 2010-04-02 16:31:11

+1; _Very_ nice. I wasn't familiar with `adjacent_find`. It's a shame this question has become community wiki.

James McNellis 2010-04-02 17:16:12

@James: Thanks. I missed Jan's `mismatch` entry before, I think that might be more elegant. Mine would be better if I were to use `boost::bind` to `std::equal_to` with an alternating flag instead of alternating `not2`. But perhaps slower.

Potatoswatter 2010-04-02 21:38:44

ansaurus

tags:

views:

answers:

How to keep only duplicates efficiently?

related questions