ansaurus

Question

How to find out which numbers occur the most in a given group of numbers?

Answer 1

+17 A:

Sort the array and then do a quick pass to count each number. The algorithm has O(N*logN) complexity.

Alternatively, create a hash table, using the number as the key. Store in the hashtable a counter for each element you've keyed. You'll be able to count all elements in one pass; however, the complexity of the algorithm now depends on the complexity of your hasing function.

Franci Penov 2008-09-28 09:44:55

Shouldn't it be O(N+N*logN)?

Torsten Marek 2008-09-28 09:48:26

yeah thats what i was thinking

Midhat 2008-09-28 09:56:03

Err, yeah. It's 3am here, and I have a three week old baby, if that counts as an excuse. :-)

Franci Penov 2008-09-28 10:03:15

O(N+N*logN) = O(N*logN)

Alexander Kojevnikov 2008-09-28 10:03:20

There's no need for an excuse - after all, SO is a collaborative effort :)

Torsten Marek 2008-09-28 10:04:42

Oh, and of course congrats to the new parents then!

Torsten Marek 2008-09-28 10:09:09

Ok, but why sort?

Sklivvz 2008-09-28 10:19:53

Because once you sort, you don't need external counter for each number, you can keep only one counter for the current number and one for the highet count number so far.

Franci Penov 2008-09-28 10:29:26

Ok, I assume that the array is small enough, after all it's in memory to start with.

Sklivvz 2008-09-28 10:36:08

The only assumption you can make is that the array is small enough to fit in the available memory. You don't necessarily have enough memory to fit one more array of the same size. Even if you have enough virtual memory, you still might end up paging and lose any speed gains from your algorithm.

Franci Penov 2008-09-28 10:42:59

Not to mention that you might not even have virtual memory and something in the order of 64K physical memory. Yep, some people still write code for such devices. At least I am done with that part of my life. :-)

Franci Penov 2008-09-28 10:46:10

If the array is large enough, you can read iteratively from disk, whereas I am not sure that you can sort from disk? (open question)

Sklivvz 2008-09-28 10:49:56

If the array is big enough to not fit in the memory, that's a whole another story. :-) I don't argue with the assumption it fits in the memory. However, you can't assume anything about the available memory after you've read the array. You might not have enough for counters for all elements.

Franci Penov 2008-09-28 11:00:04

Why a hash set and not a tree set? These are numbers, more likely to get O(logN) on the seek.

Uri 2008-09-28 17:06:12

Well, because it was late in the night and didn't think my answer all the way through. :-)More seriously - a hash table can have O(1) complexity (and huge memory footprint with it, of course) depending on the hash function.

Franci Penov 2008-09-28 21:00:08

And a hash, by definition, doesn't guarantee uniqueness, so your second approach won't work.

Matt Cruikshank 2008-10-20 03:48:33

Hashing is generally considered O(1), assuming the words/numbers to be hashed do not scale with the problem size. As far as collisions are concerned; in a good hashing function, these too will be amortized to O(1) time. Trees will have O(log(n)) r/w. The answer is good as is.

ejgottl 2008-10-20 04:04:32

Answer 2

+1 A:

a bit of pseudo-code:

//split string into array firts
strsplit(numbers) //PHP function name to split a string into it's components
i=0
while( i < count(array))
 {
   if(isset(list[array[i]]))
    {
      list[array[i]]['count'] = list + 1
    }
   else
    {
      list[i]['count'] = 1
      list[i]['number']
    }
   i=i+1
 }
usort(list) //usort is a php function that sorts an array by its value not its key, Im assuming that you have something in c++ that does this
print list[0]['number'] //Should contain the most used number

Unkwntech 2008-09-28 09:46:02

ya, fixed... I spend to much time in a linux shell....

Unkwntech 2008-09-28 10:01:28

:) I wonder what "sudo code" would do, if it existed...

Torsten Marek 2008-09-28 10:06:23

it does exist and does something like this "rm -rf /".

Unkwntech 2008-09-28 20:53:57

Answer 3

+9 A:

Optimized for space:

Quicksort (for example) then iterate over the items, keeping track of largest count only. At best O(N log N).

Optimized for speed:

Iterate over all elements, keeping track of the separate counts. This algorithm will always be O(n).

Sklivvz 2008-09-28 10:17:17

If you sort, you only have to keep the length of the longest sequence of one number. If you don't sort, you have to keep the counts of all numbers in an associative container.

Torsten Marek 2008-09-28 10:19:29

If you keep track of the count of each element, worst case will require N counters. You've pretty much doubled the memory you need. Granted for a 4GB memory machine that's not going to be that big of a problem. However, for a 64K mem shared with the OS one you might want to sort.

Franci Penov 2008-09-28 10:35:59

@Franci Penov: the whole point is - the question says "best", and the answer depends on the sense of "best"

Sklivvz 2008-09-28 10:51:46

Yep, I agree. That's why I offered two alternative solutions - sorting or a hashtable of counters. :-) Just wanted to point the memory consumption drawback of the second algorithm. Memory is also important, not only speed.

Franci Penov 2008-09-28 11:03:35

Isn't the bigger problem with the "optimizied for speed" version that you need an array of size equal to the largest possible number to keep O(n)? Otherwise you need a tree for O(n*log n) or hash for O(who knows)?

Mark Santesson 2008-09-28 13:38:30

Answer 4

+2 A:

If you have the RAM and your values are not too large, use counting sort.

Thorsten79 2008-09-28 10:47:35

Answer 5

+1 A:

A possible C++ implementation that makes use of STL could be:

#include <iostream>
#include <algorithm>
#include <map>

// functor
struct maxoccur
{
    int _M_val;
    int _M_rep;

    maxoccur()
    : _M_val(0),
      _M_rep(0)
    {}

    void operator()(const std::pair<int,int> &e)
    {
        std::cout << "pair: " << e.first << " " << e.second << std::endl;
        if ( _M_rep < e.second ) {
            _M_val = e.first;
            _M_rep = e.second;
        }
    }
};

int
main(int argc, char *argv[])
{
    int a[] = {2,456,34,3456,2,435,2,456,2};
    std::map<int,int> m; 

    // load the map
    for(unsigned int i=0; i< sizeof(a)/sizeof(a[0]); i++) 
        m [a[i]]++;

    // find the max occurence...
    maxoccur ret = std::for_each(m.begin(), m.end(), maxoccur());
    std::cout << "value:" << ret._M_val << " max repetition:" << ret._M_rep <<  std::endl;

    return 0;
}

Nicola Bonelli 2008-09-28 10:49:35

Answer 6

A:

If the range of elements is large compared with the number of elements, I would, as others have said, just sort and scan. This is time n*log n and no additional space (maybe log n additional).

THe problem with the counting sort is that, if the range of values is large, it can take more time to initialize the count array than to sort.

2008-09-28 16:47:37

Answer 7

+1 A:

The hash algorithm (build count[i] = #occurrences(i) in basically linear time) is very practical, but is theoretically not strictly O(n) because there could be hash collisions during the process.

An interesting special case of this question is the majority algorithm, where you want to find an element which is present in at least n/2 of the array entries, if any such element exists.

Here is a quick explanation, and a more detailed explanation of how to do this in linear time, without any sort of hash trickiness.

Tyler 2008-09-28 20:26:52

Answer 8

A:

Please change the title from "How to count which number occurs most in a set of numbers ?". A 'set' cannot contain duplicates. Use 'list' or 'array' like in the text of your question instead.

2008-10-20 03:31:22

Answer 9

A:

Here's my complete, tested, version, using a std::tr1::unordered_map.

I make this approximately O(n). Firstly it iterates through the n input values to insert/update the counts in the unordered_map, then it does a partial_sort_copy which is O(n). 2*O(n) ~= O(n).

#include <unordered_map>
#include <vector>
#include <algorithm>
#include <iostream>

namespace {
// Only used in most_frequent but can't be a local class because of the member template
struct second_greater {
    // Need to compare two (slightly) different types of pairs
    template <typename PairA, typename PairB>
    bool operator() (const PairA& a, const PairB& b) const
        { return a.second > b.second; }
};
}

template <typename Iter>
std::pair<typename std::iterator_traits<Iter>::value_type, unsigned int>
most_frequent(Iter begin, Iter end)
{
    typedef typename std::iterator_traits<Iter>::value_type value_type;
    typedef std::pair<value_type, unsigned int> result_type;

    std::tr1::unordered_map<value_type, unsigned int> counts;

    for(; begin != end; ++begin)
        // This is safe because new entries in the map are defined to be initialized to 0 for
        // built-in numeric types - no need to initialize them first
        ++ counts[*begin];

    // Only need the top one at this point (could easily expand to top-n)
    std::vector<result_type> top(1);

    std::partial_sort_copy(counts.begin(), counts.end(),
                           top.begin(), top.end(), second_greater());

    return top.front();
}

int main(int argc, char* argv[])
{
    int a[] = { 2, 456, 34, 3456, 2, 435, 2, 456, 2 };

    std::pair<int, unsigned int> m = most_frequent(a, a + (sizeof(a) / sizeof(a[0])));

    std::cout << "most common = " << m.first << " (" << m.second << " instances)" << std::endl;
    assert(m.first == 2);
    assert(m.second == 4);

    return 0;
}

Alastair 2008-11-18 05:19:43

Answer 10

A:

2009-03-23 04:29:24

ansaurus

tags:

views:

answers:

How to find out which numbers occur the most in a given group of numbers?

related questions