views:

1370

answers:

14

Hi,

I need the fastest and simple algorithm which finds the duplicate numbers in an array, also should be able to know the number of duplicates.

Eg: if the array is {2,3,4,5,2,4,6,2,4,7,3,8,2}

I should be able to know that there are four 2's, two 3's and three 4's.

Thanks in advance.

+5  A: 

hash tables help :p

UsAaR33
Or in the case of the example array, a bucket sort. :)
Pascal Cuoq
A: 

Depends what language!

http://php.net/manual/en/function.array-intersect.php

Rimian
need in C language
Raviteja
+2  A: 

This can be solved elegantly using Linq:

public static void Main(string[] args)
{
    List<int> list = new List<int> { 2, 3, 4, 5, 2, 4, 6, 2, 4, 7, 3, 8, 2 };

    var grouping = list
        .GroupBy(x => x)
        .Select(x => new { Item = x.Key, Count = x.Count()});

    foreach (var item in grouping)
        Console.WriteLine("Item {0} has count {1}", item.Item, item.Count);
}

Internally it probably uses hashing to partition the list, but the code hides the internal details - here we are only telling it what to calculate. The compiler / runtime is free to choose how to calculate it, and optimize as it sees fit. Thanks to Linq this same code will run efficiently whether run an a list in memory, or if the list is in a database. In real code you should use this, but I guess you want to know how internally it works.

A more imperative approach that demonstrates the actual algorithm is as follows:

    List<int> list = new List<int> { 2, 3, 4, 5, 2, 4, 6, 2, 4, 7, 3, 8, 2 };

    Dictionary<int, int> counts = new Dictionary<int, int>();
    foreach (int item in list)
    {
        if (!counts.ContainsKey(item))
        {
            counts[item] = 1;
        }
        else
        {
            counts[item]++;
        }
    }

    foreach (KeyValuePair<int, int> item in counts)
        Console.WriteLine("Item {0} has count {1}", item.Key, item.Value);

Here you can see that we iterate over the list only once, keeping a count for each item we see on the way. This would be a bad idea if the items were in a database though, so for real code, prefer to use the Linq method.

Mark Byers
He says that he needs the algorithm in C
the_drow
The question now says C as the language.
cletus
OK thanks. C doesn't have Linq, so you must use the second method.
Mark Byers
I will leave translating it to C as an exercise for the reader. :)
Mark Byers
A: 

Using hash tables / associative arrays / dictionaries (all the same thing but the terminology changes between programming environments) is the way to go.

As an example in python:

numberList = [1, 2, 3, 2, 1, ...]
countDict = {}
for value in numberList:
    countDict[value] = countDict.get(value, 0) + 1

# Now countDict contains each value pointing to their count

Similar constructions exist in most programming languages.

gab
+2  A: 

Make a hash table where the key is array item and value is counter how many times the corresponding array item has occurred in array. This is efficient way to do it, but probably not the fastest way.

Something like this (in pseudo code). You will find plenty of hash map implementations for C by googling.

 hash_map = create_new_hash_map()
 for item in array {
   if hash_map.contains_key(item){
      counter = hash_map.get(item)
   } else {
      counter = 0
   }
   counter = counter + 1
   hash_map.put(item, counter)
 }
Juha Syrjälä
+2  A: 

If you know the lower and upper bounds, and they are not too far apart, this would be a good place to use a Radix Sort. Since this smells of homework, I'm leaving it to the OP to read the article and implement the algorithm.

Stephen C
+1  A: 

If you don't want to use hash table or smtg like that, just sort the array then count the number of occurrences, something like below should work

 Arrays.sort(array);
 lastOne=array's first element;
 count=0,
 for(i=0; i <array's length; i++)
 {
  if(array[i]==lastOne)
   increment count
  else  
   print(array[i] + " has " + count + " occurrences");
   lastOne=array[i+1];
 }
erdemoo
+1 for a simple idea that in the case you can modify the array does not need a lot more memory when elements don't repeat frequently like we have with the hashmap approach.
RnR
+2  A: 

Hi

The more you tell us about the input arrays the faster we can make the algorithm. For example, for your example of single-digit numbers then creating an array of 10 elements (indexed 0:9) and accumulating number of occurrences of number in the right element of the array (poorly worded explanation but you probably catch my drift) is likely to be faster than hashing. (I say likely to be faster because I haven't done any measurements and won't).

I agree with most respondents that hashing is probably the right approach for the most general case, but it's always worth thinking about whether yours is a special case.

Regards

Mark

High Performance Mark
A: 

If the range of the numbers is known and small, you could use an array to keep track of how many times you've seen each (this is a bucket sort in essence). IF it's big you can sort it and then count duplicates as they will be following each other.

rmn
+1  A: 

here's a C version that does it with standard input; it's as fast as the length of the input (beware, the number of parameters on the command line is limited...) but should give you an idea on how to proceed:

#include <stdio.h>

int main ( int argc, char **argv ) {
    int dups[10] = { 0 };
    int i;

    for ( i = 1 ; i < argc ; i++ ) 
     dups[atoi(argv[i])]++;

    for ( i = 0 ; i < 10 ; i++ )
     printf("%d: %d\n", i, dups[i]);

    return 0;
}

example usage:

    $ gcc -o dups dups.c

    $ ./dups 0 0 3 4 5
0: 2
1: 0
2: 0
3: 1
4: 1
5: 1
6: 0
7: 0
8: 0
9: 0

caveats:

  • if you plan to count also the number of 10s, 11s, and so on -> the dups[] array must be bigger

  • left as an exercise is to implement reading from an array of integers and to determine their position

lorenzog
please don't post inherently unsafe example code as it will be read by people without much programming experience; the least you should do is parametrize the max value (eg `#define MAX_VALUE 10`) and check that the input is `>= 0` and `< MAX_VALUE` to avoid buffer overflows; for example code, a simple `assert()` would be enough; using `strtoul()` to properly validate the user input would be a bonus
Christoph
@Christoph, my intent was to have him do the checks as long as think how to count beyond 10 different elements. The exercise smells too much of homework to give a full-featured solution.
lorenzog
A: 
sambowry
A: 

The code first sorts the array and then moves unique elements to the front, keeping track of the number of elements. It's slower than using bucket sort, but more convenient.

#include <stdio.h>
#include <stdlib.h>

static int cmpi(const void *p1, const void *p2)
{
    int i1 = *(const int *)p1;
    int i2 = *(const int *)p2;
    return (i1 > i2) - (i1 < i2);
}

size_t make_unique(int values[], size_t count, size_t *occ_nums)
{
    if(!count) return 0;

    qsort(values, count, sizeof *values, cmpi);

    size_t top = 0;
    int prev_value = values[0];
    if(occ_nums) occ_nums[0] = 1;

    size_t i = 1;
    for(; i < count; ++i)
    {
        if(values[i] != prev_value)
        {
            ++top;
            values[top] = prev_value = values[i];
            if(occ_nums) occ_nums[top] = 1;
        }
        else ++occ_nums[top];
    }

    return top + 1;
}

int main(void)
{
    int values[] = { 2, 3, 4, 5, 2, 4, 6, 2, 4, 7, 3, 8, 2 };

    size_t occ_nums[sizeof values / sizeof *values];
    size_t unique_count = make_unique(
        values, sizeof values / sizeof *values, occ_nums);

    size_t i = 0;
    for(; i < unique_count; ++i)
    {
        printf("number %i occurred %u time%s\n",
            values[i], (unsigned)occ_nums[i], occ_nums[i] > 1 ? "s": "");
    }
}
Christoph
A: 

option 1: hash it. option 2: sort it and then count consecutive runs.

Southern Hospitality
A: 

There is an "algorithm" that I use all the time to find duplicate lines in a file in Unix:

sort file | uniq -d

If you implement the same strategy in C, then it is very difficult to beat it with a fancier strategy such as hash tables. Call a sorting algorithm, and then call your own function to detect duplicates in the sorted list. The sorting algorithm takes O(n*log(n)) time and the uniq function takes linear time. (Southern Hospitality makes a similar point, but I want to emphasize that what he calls "option 2" seems both simpler and faster than the more popular hash tables suggestion.)

Greg Kuperberg