views:

1605

answers:

10

Suppose we have a vector/array in C++ and we wish to count which of these N elements has maximum repetitive occurrences and output the highest count. Which algorithm is best suited for this job.

example:

int a = { 2, 456, 34, 3456, 2, 435, 2, 456, 2}

the output is 4 because 2 occurs 4 times. That is the maximum number of times 2 occurs.

+17  A: 

Sort the array and then do a quick pass to count each number. The algorithm has O(N*logN) complexity.

Alternatively, create a hash table, using the number as the key. Store in the hashtable a counter for each element you've keyed. You'll be able to count all elements in one pass; however, the complexity of the algorithm now depends on the complexity of your hasing function.

Franci Penov
Shouldn't it be O(N+N*logN)?
Torsten Marek
yeah thats what i was thinking
Midhat
Err, yeah. It's 3am here, and I have a three week old baby, if that counts as an excuse. :-)
Franci Penov
O(N+N*logN) = O(N*logN)
Alexander Kojevnikov
There's no need for an excuse - after all, SO is a collaborative effort :)
Torsten Marek
Oh, and of course congrats to the new parents then!
Torsten Marek
Ok, but why sort?
Sklivvz
Because once you sort, you don't need external counter for each number, you can keep only one counter for the current number and one for the highet count number so far.
Franci Penov
Ok, I assume that the array is small enough, after all it's in memory to start with.
Sklivvz
The only assumption you can make is that the array is small enough to fit in the available memory. You don't necessarily have enough memory to fit one more array of the same size. Even if you have enough virtual memory, you still might end up paging and lose any speed gains from your algorithm.
Franci Penov
Not to mention that you might not even have virtual memory and something in the order of 64K physical memory. Yep, some people still write code for such devices. At least I am done with that part of my life. :-)
Franci Penov
If the array is large enough, you can read iteratively from disk, whereas I am not sure that you can sort from disk? (open question)
Sklivvz
If the array is big enough to not fit in the memory, that's a whole another story. :-) I don't argue with the assumption it fits in the memory. However, you can't assume anything about the available memory after you've read the array. You might not have enough for counters for all elements.
Franci Penov
Why a hash set and not a tree set? These are numbers, more likely to get O(logN) on the seek.
Uri
Well, because it was late in the night and didn't think my answer all the way through. :-)More seriously - a hash table can have O(1) complexity (and huge memory footprint with it, of course) depending on the hash function.
Franci Penov
And a hash, by definition, doesn't guarantee uniqueness, so your second approach won't work.
Matt Cruikshank
Hashing is generally considered O(1), assuming the words/numbers to be hashed do not scale with the problem size. As far as collisions are concerned; in a good hashing function, these too will be amortized to O(1) time. Trees will have O(log(n)) r/w. The answer is good as is.
ejgottl
+1  A: 

a bit of pseudo-code:

//split string into array firts
strsplit(numbers) //PHP function name to split a string into it's components
i=0
while( i < count(array))
 {
   if(isset(list[array[i]]))
    {
      list[array[i]]['count'] = list + 1
    }
   else
    {
      list[i]['count'] = 1
      list[i]['number']
    }
   i=i+1
 }
usort(list) //usort is a php function that sorts an array by its value not its key, Im assuming that you have something in c++ that does this
print list[0]['number'] //Should contain the most used number
Unkwntech
ya, fixed... I spend to much time in a linux shell....
Unkwntech
:) I wonder what "sudo code" would do, if it existed...
Torsten Marek
it does exist and does something like this "rm -rf /".
Unkwntech
+9  A: 

Optimized for space:

Quicksort (for example) then iterate over the items, keeping track of largest count only. At best O(N log N).

Optimized for speed:

Iterate over all elements, keeping track of the separate counts. This algorithm will always be O(n).

Sklivvz
If you sort, you only have to keep the length of the longest sequence of one number. If you don't sort, you have to keep the counts of all numbers in an associative container.
Torsten Marek
If you keep track of the count of each element, worst case will require N counters. You've pretty much doubled the memory you need. Granted for a 4GB memory machine that's not going to be that big of a problem. However, for a 64K mem shared with the OS one you might want to sort.
Franci Penov
@Franci Penov: the whole point is - the question says "best", and the answer depends on the sense of "best"
Sklivvz
Yep, I agree. That's why I offered two alternative solutions - sorting or a hashtable of counters. :-) Just wanted to point the memory consumption drawback of the second algorithm. Memory is also important, not only speed.
Franci Penov
Isn't the bigger problem with the "optimizied for speed" version that you need an array of size equal to the largest possible number to keep O(n)? Otherwise you need a tree for O(n*log n) or hash for O(who knows)?
Mark Santesson
+2  A: 

If you have the RAM and your values are not too large, use counting sort.

Thorsten79
+1  A: 

A possible C++ implementation that makes use of STL could be:

#include <iostream>
#include <algorithm>
#include <map>

// functor
struct maxoccur
{
    int _M_val;
    int _M_rep;

    maxoccur()
    : _M_val(0),
      _M_rep(0)
    {}

    void operator()(const std::pair<int,int> &e)
    {
        std::cout << "pair: " << e.first << " " << e.second << std::endl;
        if ( _M_rep < e.second ) {
            _M_val = e.first;
            _M_rep = e.second;
        }
    }
};

int
main(int argc, char *argv[])
{
    int a[] = {2,456,34,3456,2,435,2,456,2};
    std::map<int,int> m; 

    // load the map
    for(unsigned int i=0; i< sizeof(a)/sizeof(a[0]); i++) 
        m [a[i]]++;

    // find the max occurence...
    maxoccur ret = std::for_each(m.begin(), m.end(), maxoccur());
    std::cout << "value:" << ret._M_val << " max repetition:" << ret._M_rep <<  std::endl;

    return 0;
}
Nicola Bonelli
A: 

If the range of elements is large compared with the number of elements, I would, as others have said, just sort and scan. This is time n*log n and no additional space (maybe log n additional).

THe problem with the counting sort is that, if the range of values is large, it can take more time to initialize the count array than to sort.

+1  A: 

The hash algorithm (build count[i] = #occurrences(i) in basically linear time) is very practical, but is theoretically not strictly O(n) because there could be hash collisions during the process.

An interesting special case of this question is the majority algorithm, where you want to find an element which is present in at least n/2 of the array entries, if any such element exists.

Here is a quick explanation, and a more detailed explanation of how to do this in linear time, without any sort of hash trickiness.

Tyler
A: 

Please change the title from "How to count which number occurs most in a set of numbers ?". A 'set' cannot contain duplicates. Use 'list' or 'array' like in the text of your question instead.

A: 

Here's my complete, tested, version, using a std::tr1::unordered_map.

I make this approximately O(n). Firstly it iterates through the n input values to insert/update the counts in the unordered_map, then it does a partial_sort_copy which is O(n). 2*O(n) ~= O(n).

#include <unordered_map>
#include <vector>
#include <algorithm>
#include <iostream>

namespace {
// Only used in most_frequent but can't be a local class because of the member template
struct second_greater {
    // Need to compare two (slightly) different types of pairs
    template <typename PairA, typename PairB>
    bool operator() (const PairA& a, const PairB& b) const
        { return a.second > b.second; }
};
}

template <typename Iter>
std::pair<typename std::iterator_traits<Iter>::value_type, unsigned int>
most_frequent(Iter begin, Iter end)
{
    typedef typename std::iterator_traits<Iter>::value_type value_type;
    typedef std::pair<value_type, unsigned int> result_type;

    std::tr1::unordered_map<value_type, unsigned int> counts;

    for(; begin != end; ++begin)
        // This is safe because new entries in the map are defined to be initialized to 0 for
        // built-in numeric types - no need to initialize them first
        ++ counts[*begin];

    // Only need the top one at this point (could easily expand to top-n)
    std::vector<result_type> top(1);

    std::partial_sort_copy(counts.begin(), counts.end(),
                           top.begin(), top.end(), second_greater());

    return top.front();
}

int main(int argc, char* argv[])
{
    int a[] = { 2, 456, 34, 3456, 2, 435, 2, 456, 2 };

    std::pair<int, unsigned int> m = most_frequent(a, a + (sizeof(a) / sizeof(a[0])));

    std::cout << "most common = " << m.first << " (" << m.second << " instances)" << std::endl;
    assert(m.first == 2);
    assert(m.second == 4);

    return 0;
}
Alastair
A: