ansaurus

Question

what are the fast algorithms to find duplicate elements in a collection and group them?

Answer 1

+6 A:

1. Sort the array O(n log n) at worst - mergesort/heapsort/binary tree sort etc

2. Compare neighbors and pull the matches out O(n)

Nathan 2009-08-26 05:38:14

that's what I did at first. using STL std::multimap and std::adjacent_find, not significantly better though.

t.g. 2009-08-26 06:07:33

The multimap will not give you the performance you want. It is faster to push all the elements into a vector and sort them when done than to keep a sorted multimap, which has to do tree balancing to keep things sorted.

fbrereto 2009-08-26 17:02:33

@fbrereton, thanks for this info, I'll test that. Probably I was intrigued by easy-to-get the multimap::euqal_range() interface too much before further thinking at first.

t.g. 2009-08-27 02:11:59

There is an equal_range algorithm that works on all STL containers: http://www.cplusplus.com/reference/algorithm/equal_range/

fbrereto 2009-08-27 02:59:38

Answer 2

+6 A:

Yes, you can do much better.

Sort them (O(n) for simple integers, O(n*log n) in general), then duplicates are guaranteed to be adjacent, making finding them quick O(n)
Use a hash table, also O(n). For each item, (a) check if it's already in the hash table; if so, its a duplicate; if not, put it in the hash table.

edit

The method you're using seems to do O(N^2) comparisons:

for i = 0; i < length; ++i           // will do length times
    for j = i+1; j < length; ++j     // will do length-i times
        compare

So for length 5 you do 4+3+2+1=10 compares; for 6 you do 15, etc. (N^2)/2 - N/2 to be exact. N*log(N) is smaller, for any reasonably high value of N.

How big is N in your case?

As far as reducing hash collisions, the best way is to get a better hash function :-D. Assuming that is not possible, if you can make a variant (e.g., different modulous), you may be able to do a nested hash.

derobert 2009-08-26 05:38:24

my problem is there hash are already equal.Suppose we are already in a collision bucket, and we are going to grouping these elements. what's the best way to do it if the element equal-predicate is so expensive that I want to save every single one?

t.g. 2009-08-26 06:15:52

Thanks for the elaboration. In the worst case(when no two elements are equal), . (N^2)/2 - N/2 is expected. but in the best case(when all are equal), only N-1 is needed, 'cause it's pulled out after the comparison. The nature of keeping the iterators valid after erasing elements during a loop is another factor I considered when choosing std::list over std::vector

t.g. 2009-08-27 02:08:42

Answer 3

+1 A:

Have you tried sorting? For example using an algorithm like quick sort? If the performance is good enough for you then that would be an easy approach.

hhafez 2009-08-26 05:38:34

Answer 4

+3 A:

Keep a hash-table based structure from value to count; if your C++ implementation doesn't offer std::hash_map (not really part of the C++ standard so far!-) use Boost or grab a version off the web. One pass over the collection (i.e., O(N)) lets you do a value->count mapping; one more pass over the hash table (<= O(N), clearly) to identify values with a count > 1 and emit them appropriately. Overall O(N), which is not the case for your suggestion.

Alex Martelli 2009-08-26 05:38:56

The second pass isn't needed. If it's already in the hash, its a duplicate.

derobert 2009-08-26 05:40:23

@derobert, yep, but you don't know **how many times** it's going to be duplicated, and the specs in the question require that. So a second pass (which doesn't change the O(N) nature of the solution) is the simplest approach!

Alex Martelli 2009-08-26 05:49:51

Answer 5

A:

The simplest is probably to just sort the list, and then to iterate over it looking for dups.

If you know something about the data, more efficient algorithms are possible.

For example, if you knew the list was large, and only contained integers from 1..n where n is fairly small, you could use a pair of boolean arrays (or a bitmap), and do something like this:

bool[] once = new bool[MAXIMUM_VALUE];
bool[] many = new bool[MAXIMUM_VALUE];
for (int i = 0; i < list.Length; i++)
    if (once[ value[i] ])
        many[ value[i] ] = true;
    else
        once[ value[i] ] = true;

Now, many[] contains an array of which values were seen more than once.

lavinio 2009-08-26 05:39:41

Answer 6

+1 A:

If it is known to be a list of integers, and if it is known that they are all between A and B (e.g. A=0, B=9), make an array of B-A elements, and create B-A containers.

In the very specific case (list of plain integers), I propose that you merely count them, since you cannot distinguish different integers, anyway:

for(int i = 0; i < countOfNumbers; i++)
    counter[number[i]]++;

If they are distinguishable, create an array of lists, and add them to the respective list.

If they are not numbers, use a std::map or std::hash_map, mapping keys to lists of values.

Martin v. Löwis 2009-08-26 05:40:31

Answer 7

A:

Most people mentioning hash/unordered map solutions are assuming O(1) insertion and query time, however it could be O(N) worst case. Additionally, you void the cost of object hashing.

Personally I would insert objects into a binary tree (O(logn) insertion for each), and keep counter at each node. This would yield O(nlogn) construction time and O(n) traversal to identify all duplicates.

ttvd 2009-08-26 06:17:42

You are right about the other answers that they ignore the cost of construction of the data structure and only measures the algorithm benefit attached to the specific data structure.Is your method just that of the implementation of std::map? it seems that you are with the others that choosing a data structure is more important that a transversing algorithm.

t.g. 2009-08-26 06:27:41

Indeed. I am sorry I cannot leave comments in your original question. Is it feasible to create an empty collection (list) for each type of object in your list (one for all E2's, one for all E3's, etc.)? If so, you could linearly traverse your object set and depending on the object find its corresponding "storage list" and insert it there. This would avoid comparisons and depending on your implementation this could be implemented in O(N).

ttvd 2009-08-26 08:24:59

I guess you are suggesting something similar to Charles Bailey. I indeed did that as a pre-step to group almost-equal elements. and then my problem arises, I cannot delegate the comparison to any other types, but themselves, and of course they are expensive. So I am looking for some way to save the number of comparisons, since it's no way to do away with them

t.g. 2009-08-26 09:07:59

Answer 8

A:

You could use a map from a representative element to a list/vector/deque of other elements. This requires relatively fewer comparisons in insertion into the container and means that you can iterate through the resulting groups without having to perform any comparisons.

This example always inserts the first representative element into the mapped deque storage as it makes the subsequent iteration through the group logically simple, but if this duplication proves a problem then it would be easy to only perform the push_back only if (!ins_pair.second).

typedef std::map<Type, std::deque<Type> > Storage;

void Insert(Storage& s, const Type& t)
{
    std::pair<Storage::iterator, bool> ins_pair( s.insert(std::make_pair(t, std::deque<Type>())) );
    ins_pair.first->second.push_back(t);
}

Iterating through the groups is then (relatively) simple and cheap:

void Iterate(const Storage& s)
{
    for (Storage::const_iterator i = s.begin(); i != s.end(); ++i)
    {
        for (std::deque<Type>::const_iterator j = i->second.begin(); j != i->second.end(); ++j)
        {
            // do something with *j
        }
    }
}

I performed some experiments for comparison and object counts. In a test with 100000 objects in random order forming 50000 groups (i.e. and average of 2 objects per group) the above method cost the following number of comparisons and copies:

1630674 comparisons, 443290 copies

(I tried bringing the number of copies down, but only really managed to at the expense of comparisons which seem to be the higher cost operation in your scenario.)

Using a multimap, and retaining the previous element in the final iteration to detect the group transitions cost this:

1756208 comparisons, 100000 copies

Using a single list and popping the front element and performing a linear search for other group members cost:

1885879088 comparisons, 100000 copies

Yes, that's ~1.9b comparisons compared to ~1.6m for my best method. To get the list method to perform anywhere near an optimal number of comparisons it would have to be sorted and this is going to cost a similar number of comparisons as building an inherently ordered container would in the first place.

Edit

I took your posted code and ran the implied algorithm (I had to make some assumptions about the code as there as some assumed definitions) over the same test data set as I used before and I counted:

1885879088 comparisons, 420939 copies

i.e. exactly the same number of comparisons as my dumb list algorithm, but with more copies. I think that that means we using essentially similar algorithms for this case. I can't see any evidence of an alternative sort order, but it looks like you want a list of the groups which contain more than one equivalent elements. This can be simply achieved in my Iterate function by adding in if (i->size > 1) clause.

I still can't see any evidence that building a sorted container such as this map of deques isn't a good (even if not optimal) strategy.

Charles Bailey 2009-08-26 08:39:33

this is almost my solution before I met my problem except that I used list instead of deque. Here comes the question: Now that you have a deque of elements pre-processed by your way, hence they cannot be represented by any other attributes but themselves(think of files that already have identical hases), i.e. they might be equal or diffenret but have to be compared by themselves(which is expensive), how can you group equal ones using minimum amount of comparisons

t.g. 2009-08-26 08:54:40

Each `deque` only contains equivalent elements, are you saying that you are using a different concept for `==` and `<` ?

Charles Bailey 2009-08-26 09:12:21

I missed that as I used list. If I were to use deque, i need to do those expensive comparisons before the insertion. No good. Because the comparisons are so expansive, I've tried my best to postpone it(including the way similar you recommended). Now comes the judement day and I have to face it - so I am looking for ways to save the number of comparisons.

t.g. 2009-08-26 09:32:12

Either the problem is more involved than you've suggested on the question or you've misunderstood my answer. In my answer *each* `deque` contains a group of equivalent elements. According to the original problem statement there is no need to order them. They already contain only equivalent elements. You now seem to be implying that there is a second sub-order that differentiates the elements in each group that you want the groups ordered by. If so, can you update the question to make this clearer?

Charles Bailey 2009-08-26 15:41:45

@Charles Bailey, thanks for the intensive discussion and helping me clarify my question. I've just updated the question. Regarding your specific implementation: the cost of comparison happens at s.insert(std::make_pair(t, std::deque<Type>())). You actually rely on the std::map pre-sorting implementation which is a B-tree as ttvd suggested in this page, which does not necessarily involve least comparison. I did use your way to filter those apparent different elements, Now I am in the deque to find real ones. Your std::list test is not like that of mine, as you don't shrink it while comparing.

t.g. 2009-08-27 03:17:45

Yes, I understand where I am taking the comparison it. I am taking it at the map insertion because I believe that this makes the overall cost cheapest. My answer assumed a single sort order (with `==` and `<` derived from this order). You need to be a bit clearer about what the two different sort orders are and how they contribute to your required output.

Charles Bailey 2009-08-27 09:22:17

@Charles Bailey, Thanks for the discussion, I did use your method, but not investigate its performance in details until the discussion with you.

t.g. 2009-08-31 09:00:17

Answer 9

A:

If I've understood the question correctly, then this is the simplest solution I can think of:

std::vector<T> input;
typedef std::vector<T>::iterator iter;

std::vector<std::pair<iter, iter> > output;

sort(input.begin(), input.end()); // O(n lg n) comparisons

for (iter cur = input.begin(); cur != input.end();){
  std::pair<iter, iter> range = equal_range(cur, input.end(), *cur); // find all elements equal to the current one O(lg n) comparisons
  cur = range.second;
  output.push_back(range);
}

Total running time: O(n log n). You have one sorting pass O(n lg n) and then a second pass where O(lg n) comparisons are performed for each group (so this is done at most n times, also yielding O(n lg n).

Note that this depends on the input being a vector. Only random access iterators have logarithmic complexity in the second pass. Bidirectional iterators would be linear.

This doesn't rely on hashing (as requested), and it preserves all the original elements (rather than just returning one element for each group, along with a count of how often it occurred).

Of course, a number of smaller constant optimizations are possible. output.reserve(input.size()) on the output vector would be a good idea, to avoid reallocation. input.end() is taken much more often than necessary too, and could be easily cached.

Depending on how big the groups are assumed to be, equal_range may not be the most efficient choice. I assume it does a binary search to get logarithmic complexity, but if each group is only a couple of elements, a simple linear scan would have been faster. In any case, the initial sorting dominates the cost though.

jalf 2009-08-27 10:06:47

After reading the STL document, I am thinking that no pre-sort container is good for situation, because they all seem to rely on 2 less-than comparison operations to determine equality, while one (==) operation is OK in my case and comparison is the only thing I want to save.

t.g. 2009-08-27 10:37:09

What pre-sort container? I'm using a vector, and sorting it manually.

jalf 2009-08-27 10:47:56

it seems sort() is in that category that relies on less_than operations, hence my comments. sorry for not having been clear.

t.g. 2009-08-27 11:01:58

It relies on less than, yes, but it doesn't use that operation to emulate equality tests. It relies on less than *because it needs to perform less than operations*.You need a less than operation to sort elements. Once again, it is *not* used twice to emulate equality. In fact, I'd like you to show me a sorting algorithm, *any* sorting algorithm which relies on ==. ;)

jalf 2009-08-27 11:38:18

it's quite resonable to rely on less-than. And that's the very reason sort is overkill in my case :-)

t.g. 2009-08-27 11:51:37

Overkill? It is fewer comparisons than you need to work on an unsorted range. How would you find duplicate elements without sorting?

jalf 2009-08-27 13:31:05

t.g. 2009-08-27 14:25:06

"How would you find duplicate elements without sorting?" -- I use the semantic of equal instead of less-than or greater-than.

t.g. 2009-08-27 14:27:14

Perhaps you should read up on sorting algorithms. The point is that if you sort the list, you never *need* to test for equality. The items are grouped automatically, using just less-than semantics.If you try to use == to find duplicates, you have to compare *every* pair of objects. That's O(N^2), in itself far more than the sorting. Try running both on paper.

jalf 2009-08-27 18:26:31

both of us have the same understanding that "you never need to test for equality", and in reality, it's impossible to sort only with equality semantic. And my situation is to find equality, So using a method that has to deduce/mimic equality with its atomic operation(less-than in the case of sort()), is at least not optimal. Sort is just not optimal in finding equality not that it does not need less-than.

t.g. 2009-08-28 03:28:44

Sort allows you to find duplicates in far fewer comparisons than a simple equality test. Finding duplicates in an unsorted range can only be done in one way: By comparing every element with every other element. That's (N^2)/2 equality tests.However, you can perform a sort in N lg(N) less-than comparisons, and once the range is sorted, you can determine duplicates in a simple linear scan, that is, exactly N equality comparisons.Unless your N is tiny, N lg(N)+N is going to add up to fewer comparisons than (N^2)/2. Hence sorting isn't such a bad idea.

jalf 2009-08-28 10:26:11

You analysis is right in most cases. But in this case, you can erase one element from the container once it's identified as a duplicate. So as you go on, the list could be changing smaller, hence less comparison each round. In the best case when all elements are equal, you need (N-1) comparisons only; But in the worst case, you've said it all. Thanks for discussing so far. I did learn more about sorting.

t.g. 2009-08-28 15:40:38

ansaurus

tags:

views:

answers:

what are the fast algorithms to find duplicate elements in a collection and group them?

edit

related questions