views:

494

answers:

12

I need a help in making an algorithm for solving one problem: There is a row with numbers which appear different times in the row, and i need to find the number that appears the most and how many times it's in the row, ex:

1-1-5-1-3-7-2-1-8-9-1-2

That would be 1 and it appears 5 times.

The algorithm should be fast (that's my problem). Any ideas ?

+10  A: 

You could keep hash table and store a count of every element in that structure, like this

h[1] = 5
h[5] = 1
...
ArsenMkrt
"Hash table" in C++ is implemented as `std::map` class.
Pavel Shved
Actually, a "Hash table" is different than a `std::map`. In a **hash table** the key is *hashed*, or converted, to a unique value then used as an index into the container. The `std::map` is a container of [key, value] pairs. Some implementations may augment the STL by providing a hash table.
Thomas Matthews
For now there is `std::tr1::unordered_map`.
Jason
+8  A: 

What you're looking for is called the mode. You can sort the array, then look for the longest repeating sequence.

Bill the Lizard
This is `O(n log n)` time and `O(1)` space. By giving up on the space you can do this in `O(n)`. Obviously this is best possible.
Jason
@Jason: For the example set of numbers given, the O(n) solution would be fine in terms of space. We don't really know the range of values that might be in the set, though, so it could be quite large.
Bill the Lizard
No you just use, say, a dictionary and then the range doesn't matter.
Jason
@Jason, when you say 'just dictinary', you don't think that complexity of calculating a hash value of an *arbitrarly large* number should be taken into account as well, do you?
Pavel Shved
@Pavel Shved: Unless the OP explicitly states otherwise I am assuming that we are dealing with a primitive numerical type (say, `int`). Therefore, the complexity of computing the hash is `O(1)`.
Jason
Jason: Obviously deciding "best" (and "better") depends on *all* the requirements, and the OP has barely listed any time or space requirements. Additionally, while lookup and insertion in a hash table is O(1), that's not the full story---in general big-oh notation hides important considerations, even while being useful itself.
Roger Pate
This method is preferred when all the items are read in as a block. For performance, the algorithm may read in all the numbers first, then sort. Dictionaries or associative containers may not be the ultimate solution.
Thomas Matthews
+6  A: 

You can't get it any faster than in linear time, as you need to at least look at each number once.

If you know that the numbers are in a certain range, you can use an additional array to sum up the occurrences of each number, otherwise you'd need a hashtable, which is slightly slower.

Both of these need additional space though and you need to loop through the counts again in the end to get the result.

Unless you really have a huge amount of numbers and absolutely require O(n) runtime, you could simply sort your array of numbers. Then you can walk once through the numbers and simply keep the count of the current number and the number with the maximum of occurences in two variables. So you save yourself a lot of space, tradeing it off with a little bit of time.

Frank
"you need to at least look at each number once." -- this statement is false.
Pavel Shved
In some input cases, you don't need to look at every input number to find the one that appears the most, but you *still* need to look at every input number to find "how many times [the most frequent one] is in the row". Frank is correct.
Roger Pate
Seems obvious to me. If I don't look at x, and x is a copy of the winner, then how can I correctly report the number of times x occurs.
GregS
@Roger, @GregS: Yes, sorry, I missed "how many times" part in the question.
Pavel Shved
+1  A: 

Here is a simple one, that is O(n log n):

Sort the vector @ O(n log n)
Create vars: int MOST, VAL, CURRENT
for ELEMENT in LIST:
    CURRENT += 1
    if CURRENT >= MOST:
        MOST = CURRENT
        VAL = ELEMENT
return (VAL, MOST)
Gregg Lind
Frank beat me too this one! O(n log n) timing, but doesn't need much additional space. And this assumes of course that you have the whole list (i.e., it's not a stream).
Gregg Lind
almost looks like python :-)
Kugel
It *is* valid Python, except for the first two lines. :)
Roger Pate
You'd have to initialize CURRENT as well!
Gregg Lind
+1  A: 

There are few methods:

Universal method is "sort it and find longest subsequence" which is O(nlog n). The fastest sort algorithm is quicksort ( average, the worst is O( n^2 ) ). Also you can use heapsort but it is quite slower in average case but asymptotic complexity is O( n log n ) also in the worst case.

If you have some information about numbers then you can use some tricks. If numbers are from the limited range then you can use part of algorithm for counting sort. It is O( n ).

If this isn't your case, there are some other sort algorithms which can do it in linear time but no one is universal.

Gaim
quicksort is hardly the fastest algorithm.
Kugel
Yes, in the average case. See the statistics
Gaim
Algorithms with the linear complexity aren't usable in general case and from algorithms based on comparison is QS the fastest in average case.
Gaim
A: 
Hamish Grubijan
where do you use compare function?
Kugel
If you are going to code this in python 3.1, look up `collections.Counter` which is exactly suited for this problem. http://docs.python.org/dev/py3k/library/collections.html
hughdbrown
Thanks, it was unnecessary - I took it out.
Hamish Grubijan
Well, I wanted to use some common data structures ...
Hamish Grubijan
+3  A: 

There is an algorithm that solves your problem in linear time (linear in the number of items in the input). The idea is to use a hash table to associate to each value in the input a count indicating the number of times that value has been seen. You will have to profile against your expected input and see if this meets your needs.

Please note that this uses O(n) extra space. If this is not acceptable, you might want to consider sorting the input as others have proposed. That solution will be O(n log n) in time and O(1) in space.

Here's an implementation in C++ using std::tr1::unordered_map:

#include <iostream>
#include <unordered_map>

using namespace std;
using namespace std::tr1;

typedef std::tr1::unordered_map<int, int> map;

int main() {
    map m;

    int a[12] = {1, 1, 5, 1, 3, 7, 2, 1, 8, 9, 1, 2};
    for(int i = 0; i < 12; i++) {
        int key = a[i];
        map::iterator it = m.find(key);
        if(it == m.end()) {
            m.insert(map::value_type(key, 1));
        }
        else {
            it->second++;
        }
    }
    int count = 0;
    int value;
    for(map::iterator it = m.begin(); it != m.end(); it++) {
        if(it->second > count) {
            count = it->second;
            value = it->first;
        }
    }

    cout << "Value: " << value << endl;
    cout << "Count: " << count << endl;
}

The algorithm works using the input integers as keys in a hashtable to a count of the number of times each integer appears. Thus the key (pun intended) to the algorithm is building this hash table:

int key = a[i];
map::iterator it = m.find(key);
if(it == m.end()) {
    m.insert(map::value_type(key, 1));
}
else {
    it->second++;
}

So here we are looking at the ith element in our input list. Then what we do is we look to see if we've already seen it. If we haven't, we add a new value to our hash table containing this new integer, and an initial count of one indicating this is our first time seeing it. Otherwise, we increment the counter associated to this value.

Once we have built this table, it's simply a matter of running through the values to find one that appears the most:

int count = 0;
int value;
for(map::iterator it = m.begin(); it != m.end(); it++) {
    if(it->second > count) {
        count = it->second;
        value = it->first;
    }
}

Currently there is no logic to handle the case of two distinct values appearing the same number of times and that number of times being the largest amongst all the values. You can handle that yourself depending on your needs.

Jason
You probably don't want to pay the price of the final iteration over all elements in the map of value->count.You can keep track of the value with the highest count (and that count) as you go along.
A: 

The best time complexity you can get here is O(n). You have to look through all elements, because the last element may be the one which determines the mode.

The solution depends on whether time or space is more important.

If space is more important, then you can sort the list then find the longest sequence of consecutive elements.

If time is more important, you can iterate through the list, keeping a count of the number of occurences of each element (e.g. hashing element -> count). While doing this, keep track of the element with max count, switching if necessary.

If you also happen know that the mode is the majority element (i.e. there are more than n/2 elements in the array with this value), then you can get O(n) speed and O(1) space efficiency.

A: 

Python 2.6

>>> from collections import defaultdict
>>> lst = [1,1,5,1,3,7,2,1,8,9,1,2]
>>> d = defaultdict(int)
>>> for i in lst:
...     d[i] += 1
...
>>> max_occurring = max((v, k) for k, v in d.items())
>>> print "%d occurs %d times" % (max_occurring[1], max_occurring[0])
1 occurs 5 times
hughdbrown
A: 

The solution below gives you the count of each number. It is a better approach than using map in terms of time and space. If you need to get the number that appeared most number of times, then this is not better than previous ones.

EDIT: This approach is useful for unsigned numbers only and the numbers starting from 1.

    std::string row = "1,1,5,1,3,7,2,1,8,9,1,2";
    const unsigned size = row.size();
    int* arr = new int[size];
    memset(arr, 0, size*sizeof(int));
    for (int i = 0; i < size; i++)
    {
        if (row[i] != ',')
        {
            int val = row[i] - '0';
            arr[val - 1]++;
        }
    }

    for (int i = 0; i < size; i++)
        std::cout << i + 1 << "-->" << arr[i] << std::endl;
Jagannath
A: 

Generic C++ solution:

#include <algorithm>
#include <iterator>
#include <map>
#include <utility>

template<class T, class U>
struct less_second
{
    bool operator()(const std::pair<T, U>& x, const std::pair<T, U>& y)
    {
        return x.second < y.second;
    }
};

template<class Iterator>
std::pair<typename std::iterator_traits<Iterator>::value_type, int>
most_frequent(Iterator begin, Iterator end)
{
    typedef typename std::iterator_traits<Iterator>::value_type vt;
    std::map<vt, int> frequency;
    for (; begin != end; ++begin) ++frequency[*begin];
    return *std::max_element(frequency.begin(), frequency.end(),
                             less_second<vt, int>());
}

#include <iostream>

int main()
{
    int array[] = {1, 1, 5, 1, 3, 7, 2, 1, 8, 9, 1, 2};
    std::pair<int, int> result = most_frequent(array, array + 12);
    std::cout << result.first << " appears " << result.second << " times.\n";
}

Haskell solution:

import qualified Data.Map as Map
import Data.List (maximumBy)
import Data.Function (on)

count = foldl step Map.empty where
    step frequency x = Map.alter next x frequency
    next  Nothing    = Just 1
    next (Just n)    = Just (n+1)

most_frequent = maximumBy (compare `on` snd) . Map.toList . count

example = most_frequent [1, 1, 5, 1, 3, 7, 2, 1, 8, 9, 1, 2]

Shorter Haskell solution, with help from stack overflow:

import qualified Data.Map as Map
import Data.List (maximumBy)
import Data.Function (on)

most_frequent = maximumBy (compare `on` snd) . Map.toList .
                Map.fromListWith (+) . flip zip (repeat 1)

example = most_frequent [1, 1, 5, 1, 3, 7, 2, 1, 8, 9, 1, 2]
Fred
A: 

Since this is homework I think it's OK to supply a solution in a different language.

In Smalltalk something like the following would be a good starting point:

SequenceableCollection>>mode
  | aBag maxCount mode |

  aBag := Bag new
            addAll: self;
            yourself.
  aBag valuesAndCountsDo: [ :val :count |
    (maxCount isNil or: [ count > maxCount ])
      ifTrue: [ mode := val.
                maxCount := count ]].

  ^mode
Bob Jarvis