ansaurus

Question

Removing Duplicates in an array in C

Answer 1

A:

Traverse through the items in the array - O(n) operation
For each item, add it to another sorted-array
Before adding it to the sorted array, check if the entry already exists - O(log n) operation

Finally, O(n log n) operation

MasterGaurav 2010-05-13 11:08:40

According to the OP, the new array needs to maintain the original sorting, though.

Nick Meyer 2010-05-13 11:13:36

You can replace the sorted array of steps 2. and 3. by a hashset, and you get amortized O(n) for the whole operation. This assumes that you have a hash function over the elements to de-dupe, but we were already assuming that we had a total order, so...

Pascal Cuoq 2010-05-13 11:15:08

@Nick perhaps MasterGaurav didn't explain it well enough, but ey is clearly thinking of an algorithm that preserves the order from the original array (duplicated elements are represented in the result array in the position of their first occurrence in the original array)

Pascal Cuoq 2010-05-13 11:17:12

@Pascal Doesn't seem clear from the current wording. I'm a little fuzzy on how you can check for an existing element in `O(log n)` time without having the array sorted. Unless we're talking about a third, temporary array?

Nick Meyer 2010-05-13 11:20:13

@Nick I'm pretty sure it's a third structure (perhaps not an array at all). Think of it as a hashset, then it won't be confused with either the source or the result array :)

Pascal Cuoq 2010-05-13 11:23:03

@Pascal, in that case, the description is missing a step -- once everything is in the hash table, you still need to loop over the input array once more to copy from input to output. The lack of that step in the description is I think what was causing the confusion.

Nick Meyer 2010-05-13 12:48:12

@Nick: with a hash table, you only have to iterate over the source array once. For each element in the source array, search the hash table. If it's not in the table, add it to the destination array and the hash table. Done. The trick here is creating the hash table and a suitable hashing function, which frankly may be overkill if we're talking about 100 elements max.

John Bode 2010-05-13 14:23:08

Answer 2

A:

i think that in C you can create a second array. then you copy the element from the original array only if this element is not already in the send array. this also preserve the order of the element.

if you read the element one by one you can discard the element before insert in the original array, this could speedup the process.

Alex 2010-05-13 11:10:56

Can you explain how do you speed up here? Search costs remain the name.. best O(log n) for sorted

MasterGaurav 2010-05-13 11:20:22

guys guys :). The problem here is getting the job done actually not the complexity. So try to focus on that pls.

LuckySlevin 2010-05-13 11:21:36

Answer 3

A:

For those who are willing to answer the question I wanted to add some specific information.

char** finduni(char *words[100],int limit)
{
//
//Methods here
//
}

is the my function. The array whose duplicates should be removed and stored in a different array is words[100]. So, the process will be done on this. I firstly thought about getting all the elements of words into another array and sort that array but that doesn't work after some tests. Just a reminder for solvers :).

LuckySlevin 2010-05-13 11:17:08

For future reference, you can edit the original question and put this information there. Visitors are more likely to catch that there. I'll edit it for you to start.

Nick Meyer 2010-05-13 11:21:49

Answer 4

A:

As Thomas suggested in a comment, if each element of the array is guaranteed to be from a limited set of values (such as a char) you can achieve this in O(n) time.

Keep an array of 256 bool (or int if your compiler doesn't support bool) or however many different discrete values could possibly be in the array. Initialize all the values to false.
Scan the input array one-by-one.
For each element, if the corresponding value in the bool array is false, add it to the output array and set the bool array value to true. Otherwise, do nothing.

Nick Meyer 2010-05-13 11:17:20

the problem is array isn't a char array it is a string array.

LuckySlevin 2010-05-13 11:20:10

Yep, I see you've added that now.

Nick Meyer 2010-05-13 11:22:51

Answer 5

+2 A:

Well, here is a version for char types. Note it doesn't scale.

#include "stdio.h"
#include "string.h"

void removeDuplicates(unsigned char *string)
{
   unsigned char allCharacters [256] = { 0 };
   int lookAt;
   int writeTo = 0;
   for(lookAt = 0; lookAt < strlen(string); lookAt++)
   {
      if(allCharacters[ string[lookAt] ] == 0)
      {
         allCharacters[ string[lookAt] ] = 1;  // mark it seen
         string[writeTo++] = string[lookAt];     // copy it
      }
   }
   string[writeTo] = '\0';
}

int main()
{
   char word[] = "abbbcdefbbbghasdddaiouasdf";
   removeDuplicates(word);
   printf("Word is now [%s]\n", word);
   return 0;
}

The following is the output:

Word is now [abcdefghsiou]

Is that something like what you want? You can modify the method if there are spaces between the letters, but if you use int, float, double or char * as the types, this method won't scale at all.

EDIT

I posted and then saw your clarification, where it's an array of char *. I'll update the method.

I hope this isn't too much code. I adapted this QuickSort algorithm and basically added index memory to it. The algorithm is O(n log n), as the 3 steps below are additive and that is the worst case complexity of 2 of them.

Sort the array of strings, but every swap should be reflected in the index array as well. After this stage, the i'th element of originalIndices holds the original index of the i'th element of the sorted array.
Remove duplicate elements in the sorted array by setting them to NULL, and setting the index value to elements, which is the highest any can be.
Sort the array of original indices, and make sure every swap is reflected in the array of strings. This gives us back the original array of strings, except the duplicates are at the end and they are all NULL.
For good measure, I return the new count of elements.

Code:

#include "stdio.h"
#include "string.h"
#include "stdlib.h"

void sortArrayAndSetCriteria(char **arr, int elements, int *originalIndices)
{
   #define  MAX_LEVELS  1000
   char *piv;
   int  beg[MAX_LEVELS], end[MAX_LEVELS], i=0, L, R;
   int idx, cidx;
   for(idx = 0; idx < elements; idx++)
      originalIndices[idx] = idx;
   beg[0] = 0;
   end[0] = elements;
   while (i>=0)
   {
      L = beg[i];
      R = end[i] - 1;
      if (L<R)
      {
         piv = arr[L];
         cidx = originalIndices[L];
         if (i==MAX_LEVELS-1)
            return;
         while (L < R)
         {
            while (strcmp(arr[R], piv) >= 0 && L < R) R--;
            if (L < R)
            {
               arr[L] = arr[R];
               originalIndices[L++] = originalIndices[R];
            }
            while (strcmp(arr[L], piv) <= 0 && L < R) L++;
            if (L < R)
            {
               arr[R] = arr[L];
               originalIndices[R--] = originalIndices[L];
            }
         }
         arr[L] = piv;
         originalIndices[L] = cidx;
         beg[i + 1] = L + 1;
         end[i + 1] = end[i];
         end[i++] = L;
      }
      else
      {
         i--;
      }
   }
}

int removeDuplicatesFromBoth(char **arr, int elements, int *originalIndices)
{
   // now remove duplicates
   int i = 1, newLimit = 1;
   char *curr = arr[0];
   while (i < elements)
   {
      if(strcmp(curr, arr[i]) == 0)
      {
         arr[i] = NULL;   // free this if it was malloc'd
         originalIndices[i] = elements;  // place it at the end
      }
      else
      {
         curr = arr[i];
         newLimit++;
      }
      i++;
   }
   return newLimit;
}

void sortArrayBasedOnCriteria(char **arr, int elements, int *originalIndices)
{
   #define  MAX_LEVELS  1000
   int piv;
   int beg[MAX_LEVELS], end[MAX_LEVELS], i=0, L, R;
   int idx;
   char *cidx;
   beg[0] = 0;
   end[0] = elements;
   while (i>=0)
   {
      L = beg[i];
      R = end[i] - 1;
      if (L<R)
      {
         piv = originalIndices[L];
         cidx = arr[L];
         if (i==MAX_LEVELS-1)
            return;
         while (L < R)
         {
            while (originalIndices[R] >= piv && L < R) R--;
            if (L < R)
            {
               arr[L] = arr[R];
               originalIndices[L++] = originalIndices[R];
            }
            while (originalIndices[L] <= piv && L < R) L++;
            if (L < R)
            {
               arr[R] = arr[L];
               originalIndices[R--] = originalIndices[L];
            }
         }
         arr[L] = cidx;
         originalIndices[L] = piv;
         beg[i + 1] = L + 1;
         end[i + 1] = end[i];
         end[i++] = L;
      }
      else
      {
         i--;
      }
   }
}

int removeDuplicateStrings(char *words[], int limit)
{
   int *indices = (int *)malloc(limit * sizeof(int));
   int newLimit;
   sortArrayAndSetCriteria(words, limit, indices);
   newLimit = removeDuplicatesFromBoth(words, limit, indices);
   sortArrayBasedOnCriteria(words, limit, indices);
   free(indices);
   return newLimit;
}

int main()
{
   char *words[] = { "abc", "def", "bad", "hello", "captain", "def", "abc", "goodbye" };
   int newLimit = removeDuplicateStrings(words, 8);
   int i = 0;
   for(i = 0; i < newLimit; i++) printf(" Word @ %d = %s\n", i, words[i]);
   return 0;
}

Phil 2010-05-13 11:24:08

read the first post dude.

LuckySlevin 2010-05-13 11:25:25

Thanks so much. Actually it would be enough too that you gave me the idea and i could code the idea. Thanks for your efforts. I will try this in no time.

LuckySlevin 2010-05-13 13:05:04

No problem. It's been a while since I used C so there may be corner cases in there, and it's not the most DRY code out there. Hope it helps!

Phil 2010-05-13 13:32:37

Answer 6

A:

You know how to do it for char type, right? You can do same thing with strings, but instead of using array of bools (which is technically an implementation of "set" object), you'll have to simulate the "set"(or array of bools) with a linear array of strings you already encountered. I.e. you have an array of strings you already saw, for each new string you check if it is in array of "seen" strings, if it is, then you ignore it (not unique), if it is not in array, you add it to both array of seen strings and output. If you have a small number of different strings (below 1000), you could ignore performance optimizations, and simply compare each new string with everything you already saw before.

With large number of strings (few thousands), however, you'll need to optimize things a bit:

1) Every time you add a new string to an array of strings you already saw, sort the array with insertion sort algorithm. Don't use quickSort, because insertion sort tends to be faster when data is almost sorted.

2) When checking if string is in array, use binary search.

If number of different strings is reasonable (i.e. you don't have billions of unique strings), this approach should be fast enough.

SigTerm 2010-05-13 13:06:06

ansaurus

tags:

views:

answers:

Removing Duplicates in an array in C

related questions