1. Sort the array O(n log n) at worst - mergesort/heapsort/binary tree sort etc
2. Compare neighbors and pull the matches out O(n)
1. Sort the array O(n log n) at worst - mergesort/heapsort/binary tree sort etc
2. Compare neighbors and pull the matches out O(n)
Yes, you can do much better.
Sort them (O(n) for simple integers, O(n*log n) in general), then duplicates are guaranteed to be adjacent, making finding them quick O(n)
Use a hash table, also O(n). For each item, (a) check if it's already in the hash table; if so, its a duplicate; if not, put it in the hash table.
The method you're using seems to do O(N^2) comparisons:
for i = 0; i < length; ++i // will do length times
for j = i+1; j < length; ++j // will do length-i times
compare
So for length 5 you do 4+3+2+1=10 compares; for 6 you do 15, etc. (N^2)/2 - N/2 to be exact. N*log(N) is smaller, for any reasonably high value of N.
How big is N in your case?
As far as reducing hash collisions, the best way is to get a better hash function :-D. Assuming that is not possible, if you can make a variant (e.g., different modulous), you may be able to do a nested hash.
Have you tried sorting? For example using an algorithm like quick sort? If the performance is good enough for you then that would be an easy approach.
Keep a hash-table based structure from value to count; if your C++ implementation doesn't offer std::hash_map
(not really part of the C++ standard so far!-) use Boost or grab a version off the web. One pass over the collection (i.e., O(N)) lets you do a value->count mapping; one more pass over the hash table (<= O(N), clearly) to identify values with a count > 1 and emit them appropriately. Overall O(N), which is not the case for your suggestion.
The simplest is probably to just sort the list, and then to iterate over it looking for dups.
If you know something about the data, more efficient algorithms are possible.
For example, if you knew the list was large, and only contained integers from 1..n where n is fairly small, you could use a pair of boolean arrays (or a bitmap), and do something like this:
bool[] once = new bool[MAXIMUM_VALUE];
bool[] many = new bool[MAXIMUM_VALUE];
for (int i = 0; i < list.Length; i++)
if (once[ value[i] ])
many[ value[i] ] = true;
else
once[ value[i] ] = true;
Now, many[] contains an array of which values were seen more than once.
If it is known to be a list of integers, and if it is known that they are all between A and B (e.g. A=0, B=9), make an array of B-A elements, and create B-A containers.
In the very specific case (list of plain integers), I propose that you merely count them, since you cannot distinguish different integers, anyway:
for(int i = 0; i < countOfNumbers; i++)
counter[number[i]]++;
If they are distinguishable, create an array of lists, and add them to the respective list.
If they are not numbers, use a std::map or std::hash_map, mapping keys to lists of values.
Most people mentioning hash/unordered map solutions are assuming O(1) insertion and query time, however it could be O(N) worst case. Additionally, you void the cost of object hashing.
Personally I would insert objects into a binary tree (O(logn) insertion for each), and keep counter at each node. This would yield O(nlogn) construction time and O(n) traversal to identify all duplicates.
You could use a map from a representative element to a list/vector/deque of other elements. This requires relatively fewer comparisons in insertion into the container and means that you can iterate through the resulting groups without having to perform any comparisons.
This example always inserts the first representative element into the mapped deque storage as it makes the subsequent iteration through the group logically simple, but if this duplication proves a problem then it would be easy to only perform the push_back
only if (!ins_pair.second)
.
typedef std::map<Type, std::deque<Type> > Storage;
void Insert(Storage& s, const Type& t)
{
std::pair<Storage::iterator, bool> ins_pair( s.insert(std::make_pair(t, std::deque<Type>())) );
ins_pair.first->second.push_back(t);
}
Iterating through the groups is then (relatively) simple and cheap:
void Iterate(const Storage& s)
{
for (Storage::const_iterator i = s.begin(); i != s.end(); ++i)
{
for (std::deque<Type>::const_iterator j = i->second.begin(); j != i->second.end(); ++j)
{
// do something with *j
}
}
}
I performed some experiments for comparison and object counts. In a test with 100000 objects in random order forming 50000 groups (i.e. and average of 2 objects per group) the above method cost the following number of comparisons and copies:
1630674 comparisons, 443290 copies
(I tried bringing the number of copies down, but only really managed to at the expense of comparisons which seem to be the higher cost operation in your scenario.)
Using a multimap, and retaining the previous element in the final iteration to detect the group transitions cost this:
1756208 comparisons, 100000 copies
Using a single list and popping the front element and performing a linear search for other group members cost:
1885879088 comparisons, 100000 copies
Yes, that's ~1.9b comparisons compared to ~1.6m for my best method. To get the list method to perform anywhere near an optimal number of comparisons it would have to be sorted and this is going to cost a similar number of comparisons as building an inherently ordered container would in the first place.
Edit
I took your posted code and ran the implied algorithm (I had to make some assumptions about the code as there as some assumed definitions) over the same test data set as I used before and I counted:
1885879088 comparisons, 420939 copies
i.e. exactly the same number of comparisons as my dumb list algorithm, but with more copies. I think that that means we using essentially similar algorithms for this case. I can't see any evidence of an alternative sort order, but it looks like you want a list of the groups which contain more than one equivalent elements. This can be simply achieved in my Iterate
function by adding in if (i->size > 1)
clause.
I still can't see any evidence that building a sorted container such as this map of deques isn't a good (even if not optimal) strategy.
If I've understood the question correctly, then this is the simplest solution I can think of:
std::vector<T> input;
typedef std::vector<T>::iterator iter;
std::vector<std::pair<iter, iter> > output;
sort(input.begin(), input.end()); // O(n lg n) comparisons
for (iter cur = input.begin(); cur != input.end();){
std::pair<iter, iter> range = equal_range(cur, input.end(), *cur); // find all elements equal to the current one O(lg n) comparisons
cur = range.second;
output.push_back(range);
}
Total running time: O(n log n)
.
You have one sorting pass O(n lg n)
and then a second pass where O(lg n)
comparisons are performed for each group (so this is done at most n
times, also yielding O(n lg n)
.
Note that this depends on the input being a vector. Only random access iterators have logarithmic complexity in the second pass. Bidirectional iterators would be linear.
This doesn't rely on hashing (as requested), and it preserves all the original elements (rather than just returning one element for each group, along with a count of how often it occurred).
Of course, a number of smaller constant optimizations are possible. output.reserve(input.size())
on the output vector would be a good idea, to avoid reallocation. input.end()
is taken much more often than necessary too, and could be easily cached.
Depending on how big the groups are assumed to be, equal_range
may not be the most efficient choice. I assume it does a binary search to get logarithmic complexity, but if each group is only a couple of elements, a simple linear scan would have been faster. In any case, the initial sorting dominates the cost though.