ansaurus

Question

Algorithm: efficient way to remove duplicate integers from an array

Answer 1

+11 A:

If you are looking for the superior O-notation, then sorting the array with an O(n log n) sort then doing a O(n) traversal may be the best route. Without sorting, you are looking at O(n^2).

Edit: if you are just doing integers, then you can also do radix sort to get O(n).

carl 2009-10-07 16:51:33

Jeff B's answer is merely O(n). Hash-sets and hash-dictionaries are the bees knees.

ChrisW 2009-10-07 17:07:43

ChrisW: hash sets/dictionaries are only O(1) if you assume no collisions. (I'm not saying I wouldn't use them for this problem -- I probably would -- it's just a fallacy to claim that they're truly O(1).)

Laurence Gonsalves 2009-10-07 17:36:00

Actually, since you know the size of the array before-hand, you can guarantee O(1). Then you can trade off collisions vs how much additional memory you use.

Vitali 2009-10-07 17:54:10

You might want to rethink that downvote - newly posted conditions to the problem make Jeff B's solution invalid.

Mark Ransom 2009-10-07 18:13:40

You might want to elaborate on "traversal", since a naive erasure method might result in O(n^2) for large numbers of duplicates.

Mark Ransom 2009-10-07 18:15:39

Jeff's is still fine, you can do the comparison and motion in place. Take a look at my (sadly ignored) fully correct solution below.

Andy Ross 2009-10-07 18:26:35

Answer 2

A:

An array should obviously be "traversed" right-to-left to avoid unneccessary copying of values back and forth.

If you have unlimited memory, you can allocate a bit array for sizeof(type-of-element-in-array) / 8 bytes to have each bit signify whether you've already encountered corresponding value or not.

If you don't, I can't think of anything better than traversing an array and comparing each value with values that follow it and then if duplicate is found, remove these values altogether. This is somewhere near O(n^2) (or O((n^2-n)/2)).

IBM has an article on kinda close subject.

Anton Gogolev 2009-10-07 16:52:19

Indeed - an O(n) pass to find the largest element wouldn't increase the overall O() cost.

Douglas Leeder 2009-10-07 16:56:17

Answer 3

+5 A:

Well, it's basic implementation is quite simple. Go through all elements, check whether there are duplicates in the remaining ones and shift the rest over them.

It's terrible inefficient and you could speed it up by a helper-array for the output or sorting/binary trees, but this doesn't seem to be allowed.

Dario 2009-10-07 16:54:46

OTOH, the additional code required to implement a sorting tree might be less (memory) efficient than the simple solution, and is probably less efficient at run-time for small (say fewer than 100 elements) arrays.

TMN 2010-06-10 13:12:18

Answer 4

+6 A:

You could do this in a single traversal, if you are willing to sacrifice memory. You can simply tally whether you have seen an integer or not in a hash/associative array. If you have already seen a number, remove it as you go, or better yet, move numbers you have not seen into a new array, avoiding any shifting in the original array.

In Perl:

foreach $i (@myary) {
    if(!defined $seen{$i}) {
        $seen{$i} = 1;
        push @newary, $i;
    }
}

Jeff B 2009-10-07 16:55:08

It's not clear if the answer has to be in the original array.

Douglas Leeder 2009-10-07 16:59:57

To do this without requiring a new array, you could simply replace the duplicate with an element popped off the end of the array, and redo the current loop, as the problem does not specify that order matters. This requires some extra bounds checking, but is very do-able.

Jeff B 2009-10-07 17:09:19

This was a good idea, until the question was edited. Your hashtable idea is apparently against the rules.

WCWedin 2009-10-07 18:57:43

I don't get why this answer gets voted the most. It's written in perl and uses vital features not available in C, as the question asks.

LiraNuna 2009-10-07 19:31:23

the question asked for c code, not perl. using perl gets you hashtables and "push" for free. If i could do it in scala you would just call input.removeDuplicates, but i doubt that would have been acceptable to the interviewers :)

Peter Recore 2009-10-08 17:15:11

Answer 5

+2 A:

Let's see:

O(N) pass to find min/max allocate
bit-array for found
O(N) pass swapping duplicates to end.

Douglas Leeder 2009-10-07 16:59:01

Given that they're only integers, for simplicity you could assume 32bit and not bother looking for min/max: 2^32 bits is "only" 512MB, so finding the bounds is just a memory-use and O(1) time optimisation (granted, a hefty optimisation in the case of the example given). And if they're 64bit, it's irrelevant since you don't know that the min and max won't be further apart than the number of bits of memory you have.

Steve Jessop 2009-10-07 18:14:54

Theory aside, wouldn't allocating 512MB take more time than finding the min/max?

LiraNuna 2009-10-07 21:14:36

Depends how much data there is, and what the min/max are. If you're looking at more than 512MB of input, then quite possibly it's faster to avoid that extra O(N) pass. Of course if you're looking at that much input, then it's less likely you have 512MB to spare. In cases where the min/max are close to 0/INT_MAX, then the optimisation doesn't help either. I'm just saying that although the first step obviously helps for small numbers, it can't avoid the fact that this algorithm uses UINT_MAX bits in the worst case, so you need to plan for that limitation.

Steve Jessop 2009-10-07 21:35:44

You may well be right - in any case clarification of the question means that using a bit-array is out. I'll leave this answer in case someone comes along later without the constraints and wants to view all possible answers.

Douglas Leeder 2009-10-07 21:45:29

Answer 6

+2 A:

If you are allowed to use C++, a call to std::sort followed by a call to std::unique will give you the answer. The time complexity is O(N log N) for the sort and O(N) for the unique traversal.

And if C++ is off the table there isn't anything that keeps these same algorithms from being written in C.

fbrereto 2009-10-07 17:01:01

"One caveat is that the expected algorithm should not required the array to be sorted first."

sbi 2009-10-07 17:30:45

It doesn't say you can't sort the array once you get it... Without using O(N) external memory sorting is the only way to do it in O(N log N) or better.

Greg Rogers 2009-10-07 17:45:47

For the purpose of the problem, standard library utils should not be used. Regarding sorting, however, the more I think of it, the more unsure I am whether it is ok or not.

ejel 2009-10-07 17:55:02

I think answers refering to C++ and C++ standard functions are useful, even if they don't answer the original question, as they provide a more rounded answer to people who find this question later.

Douglas Leeder 2009-10-07 21:46:45

Answer 7

A:

It'd be cool if you had a good DataStructure that could quickly tell if it contains an integer. Perhaps a tree of some sort.

DataStructure elementsSeen = new DataStructure();
int elementsRemoved = 0;
for(int i=0;i<array.Length;i++){
  if(elementsSeen.Contains(array[i])
    elementsRemoved++;
  else
    array[i-elementsRemoved] = array[i];
}
array.Length = array.Length - elementsRemoved;

Mike Blandford 2009-10-07 17:02:59

Answer 8

A:

This can be done in a single pass, in O(N) time in the number of integers in the input list, and O(N) storage in the number of unique integers.

Walk through the list from front to back, with two pointers "dst" and "src" initialized to the first item. Start with an empty hash table of "integers seen". If the integer at src is not present in the hash, write it to the slot at dst and increment dst. Add the integer at src to the hash, then increment src. Repeat until src passes the end of the input list.

Andy Ross 2009-10-07 17:06:38

In the modification to the original question, hash tables aren't allowed. Your two pointer approach is a nice way to compact the output once you've identified the duplicates, though.

Mark Ransom 2009-10-07 19:14:11

Answer 9

A:

You can sort the array using radix sort which you can do in O(n) and then simple O(n) traversal. So the final time complexity is O(n).

Update: Size of int in C is limited, therefore the length of value representation is constant.

Viliam 2009-10-07 17:06:40

Radix sort isn't really O(N), you got fooled by someone. The number of buckets is proportional to the length of the input numbers. The length of a number scales as the logarithm of the maximum representable number. So if M is the maximum number of possible unique sortable items (a practical upper bound for any "N" used in a sorting analysis -- sorting large numbers of duplicates using a comparison sort is innapropriate), radix sort is O(M), and thus O(N).

Andy Ross 2009-10-07 17:11:39

Sorry, should read O(M log M) and O(N log N) on the last line. Comments aren't editable?

Andy Ross 2009-10-07 17:12:41

@Andy: However, you can copy and paste them into a new comment and delete the old one.

sbi 2009-10-07 17:29:17

@Andy: Read the question. Array of integers in C. Since int in C has limited size, maximum representable number is constant.

Viliam 2009-10-07 23:22:42

Again, that's a poor analysis. You're substituting the algorithmic complexity of "radix sort on 32 bit integers" for "radix sort" in general and rolling what would otherwise be a log(N) dependency into a "constant factor" by cheating. This is cooking the books, basically. There is no algorithmic benefit to radix sort, period.

Andy Ross 2009-10-08 03:47:30

Answer 10

+7 A:

How about:

void rmdup(int *array, int length)
{
    int *current , *end = array + length - 1;;

    for ( current = array + 1; array < end; array++, current = array + 1 )
    {
        while ( current < end )
        {
            if ( *current == *array )
            {
                *current = *end--;
            }
            else
            {
                current++;
            }
        }
    }
}

Should be O(n^2) or less.

mocj 2009-10-07 17:39:05

This is the simple solution and is more than likely what the interview question is looking for.

Kirk Broadhurst 2009-10-07 23:08:23

They might even be checking to see that you don't suffer from indulging in premature optimization unless they've given you runtime constraints too! :-)

Trevor Tippins 2009-10-07 23:48:34

This solution does not need pre-sorting and is what I was looking for.

ejel 2009-10-12 20:58:39

Lol, though it's definately faster to sort the array and work on the sorted one. Sorting should be provided by an API and is imho no premature optimization.

ziggystar 2009-11-20 12:14:48

Answer 11

+17 A:

A solution suggested by my girlfriend is a variation of merge sort. The only modification is that during the merge step, just disregard duplicated values. This solution would be as well O(n log n). In this approach, the sorting/duplication removal are combined together. However, I'm not sure if that makes any difference, though.

ejel 2009-10-07 18:00:27

Great suggestion, but you'll need some bookkeeping to keep track of the end of each merge output. I actually did this once, and yes eliminating the duplicates as you merge makes it much faster.

Mark Ransom 2009-10-07 18:28:26

P.S. Marry that girl!

Mark Ransom 2009-10-07 19:48:12

It's not clear whether O(N/2) extra space counts as the "helper data structure" banned in the question - I don't know whether the restriction is intended to stipulate O(1) extra space, or just to stipulate that the answer should not depend on a big ol' data structure implementation. Maybe a standard merge is fine. But if not, top tip: do not attempt to write an in-place merge sort in an interview, unless you *really* know what you're doing.

Steve Jessop 2009-10-07 22:23:29

The rep you earned from this answer doesn't belong to you. :P

GMan 2009-10-10 16:14:27

Answer 12

+5 A:

One more efficient implementation

int i, j;

/* new length of modified array */
int NewLength = 1;

for(i=1; i< Length; i++){

   for(j=0; j< NewLength ; j++)
   {

      if(array[i] == array[j])
      break;
   }

   /* if none of the values in index[0..j] of array is not same as array[i],
      then copy the current value to corresponding new position in array */

  if (j==NewLength )
      array[NewLength++] = array[i];
}

In this implementation there is no need for sorting the array. Also if a duplicate element is found, there is no need for shifting all elements after this by one position.

The output of this code is array[] with size NewLength

Here we are starting from the 2nd elemt in array and comparing it with all the elements in array up to this array. We are holding an extra index variable 'NewLength' for modifying the input array. NewLength variabel is initialized to 0.

Element in array[1] will be compared with array[0]. If they are different, then value in array[NewLength] will be modified with array[1] and increment NewLength. If they are same, NewLength will not be modified.

So if we have an array [1 2 1 3 1], then

In First pass of 'j' loop, array[1] (2) will be compared with array0, then 2 will be written to array[NewLength] = array[1] so array will be [1 2] since NewLength = 2

In second pass of 'j' loop, array[2] (1) will be compared with array0 and array1. Here since array[2] (1) and array0 are same loop will break here. so array will be [1 2] since NewLength = 2

and so on

Byju 2009-10-07 18:35:23

Nice one. I have a suggestion to improve.Second nested loop can be changed to for(j=0; j < NewLength; j++) and last if checking can be changed to if (j == NewLength )

Vadakkumpadath 2009-10-08 10:14:24

That was a great suggession. I have updated the code based on ur comment

Byju 2009-10-08 16:43:08

Answer 13

A:

Some of the answers that are written here are pretty trivial (O(n^2) or sorting and traversing in O(NlogN)) and I'm assuming that is not what was expected in an interview for Microsoft. Obviously any answer above O(n) wasn't what they were looking for. The update states that there shouldn't be any helper data structures so any answer that has one (a hash table, tree, bit array or whatever) shouldn't be a valid solution.

If you can allocate additional memory then Jeff B's answer is probably easiest way to do it. I have a good answer for questions like these but the MAXINT needs to be bounded by the size of the array. (Example: An array of size 100 may contain any number between 1 and 100. Remove the dups as the original question)

The answer to this in O(n) time and O(1) memory is:

// FLAG ALL DUPS IN THE ORIGIN ARRAY
int maxNumInArray = findMaxNumInArray(arr);
int dup = findMinNumInArray(arr) - 1;
for (int i=0; i < arrLength; ++i) {
    int seekIndex = arr[i] % (maxNumInArray+1);
    if (arr[seekIndex] > maxNumInArray)
        arr[i] = dup; // invalidate index
    else
        arr[seekIndex] = arr[seekIndex] + maxNumInArray;
}

// REMOVE EMPTY SPACES
int i = 0;
int j = arrLength(arr)-1;
while (i<j) {
    while (arr[i] != dup)
        ++i;
    while (arr[j] == dup)
        --j;
    swap(arr[i], arr[j]);
}

If you don't know the bounds my answer isn't useful but u can try and play with it. Oh, and this specific variation wont work with negative numbers but its not a problem to fix it.

Itsik 2009-10-07 18:56:37

Answer 14

A:

In Java I would solve it like this. Don't know how to write this in C.

   int length = array.length;
   for (int i = 0; i < length; i++) 
   {
      for (int j = i + 1; j < length; j++) 
      {
         if (array[i] == array[j]) 
         {
            int k, j;
            for (k = j + 1, l = j; k < length; k++, l++) 
            {
               if (array[k] != array[i]) 
               {
                  array[l] = array[k];
               }
               else
               {
                  l--;
               }
            }
            length = l;
         }
      }
   }

Dominik 2009-10-07 19:02:07

If you overwrite the duplicates you find with the value at the end of the array you can avoid the shifting of the whole array in your inner for() loop. That will bring you to O(n^2) from O(n^3). My C implementation is floating around here somewhere...

mocj 2009-10-07 19:41:30

I thought, shifting was part of the requirement, but you are right of course.

Dominik 2009-10-07 19:48:01

@mocj: I like your solution, looks very elegant. But I think it doesn't work if the last two elements are equal, because you stop checking for equality one before the last. (comenting here because have too view reputation to comment anywhere else :( )

Dominik 2009-10-07 20:33:19

You're right except that the original problem states that the values at the end of the array are negligible. Since you aren't returning the length of the modified array the distinction between the last value and the second to last is unimportant when the two values are equal. Where does the caller interpret the end of the returned array to be

mocj 2009-10-07 21:13:33

Answer 15

A:

Insert all the elements in a binary tree the disregards duplicates - O(nlog(n)). Then extract all of them back in the array by doing a traversal - O(n). I am assuming that you don't need order preservation.

Ashwin 2009-10-07 19:22:35

Answer 16

+1 A:

I've posted this once before on SO, but I'll reproduce it here because it's pretty cool. It uses hashing, building something like a hash set in place. It's guaranteed to be O(1) in axillary space (the recursion is a tail call), and is typically O(N) time complexity. The algorithm is as follows:

Take the first element of the array, this will be the sentinel.
Reorder the rest of the array, as much as possible, such that each element is in the position corresponding to its hash. As this step is completed, duplicates will be discovered. Set them equal to sentinel.
Move all elements for which the index is equal to the hash to the beginning of the array.
Move all elements that are equal to sentinel, except the first element of the array, to the end of the array.
What's left between the properly hashed elements and the duplicate elements will be the elements that couldn't be placed in the index corresponding to their hash because of a collision. Recurse to deal with these elements.

This can be shown to be O(N) provided no pathological scenario in the hashing: Even if there are no duplicates, approximately 2/3 of the elements will be eliminated at each recursion. Each level of recursion is O(n) where small n is the amount of elements left. The only problem is that, in practice, it's slower than a quick sort when there are few duplicates, i.e. lots of collisions. However, when there are huge amounts of duplicates, it's amazingly fast.

Edit: In current implementations of D, hash_t is 32 bits. Everything about this algorithm assumes that there will be very few, if any, hash collisions in full 32-bit space. Collisions may, however, occur frequently in the modulus space. However, this assumption will in all likelihood be true for any reasonably sized data set. If the key is less than or equal to 32 bits, it can be its own hash, meaning that a collision in full 32-bit space is impossible. If it is larger, you simply can't fit enough of them into 32-bit memory address space for it to be a problem. I assume hash_t will be increased to 64 bits in 64-bit implementations of D, where datasets can be larger. Furthermore, if this ever did prove to be a problem, one could change the hash function at each level of recursion.

Here's an implementation in the D programming language:

void uniqueInPlace(T)(ref T[] dataIn) {
    uniqueInPlaceImpl(dataIn, 0);
}

void uniqueInPlaceImpl(T)(ref T[] dataIn, size_t start) {
    if(dataIn.length - start < 2)
        return;

    invariant T sentinel = dataIn[start];
    T[] data = dataIn[start + 1..$];

    static hash_t getHash(T elem) {
        static if(is(T == uint) || is(T == int)) {
            return cast(hash_t) elem;
        } else static if(__traits(compiles, elem.toHash)) {
            return elem.toHash;
        } else {
            static auto ti = typeid(typeof(elem));
            return ti.getHash(&elem);
        }
    }

    for(size_t index = 0; index < data.length;) {
        if(data[index] == sentinel) {
            index++;
            continue;
        }

        auto hash = getHash(data[index]) % data.length;
        if(index == hash) {
            index++;
            continue;
        }

        if(data[index] == data[hash]) {
            data[index] = sentinel;
            index++;
            continue;
        }

        if(data[hash] == sentinel) {
            swap(data[hash], data[index]);
            index++;
            continue;
        }

        auto hashHash = getHash(data[hash]) % data.length;
        if(hashHash != hash) {
            swap(data[index], data[hash]);
            if(hash < index)
                index++;
        } else {
            index++;
        }
    }


    size_t swapPos = 0;
    foreach(i; 0..data.length) {
        if(data[i] != sentinel && i == getHash(data[i]) % data.length) {
            swap(data[i], data[swapPos++]);
        }
    }

    size_t sentinelPos = data.length;
    for(size_t i = swapPos; i < sentinelPos;) {
        if(data[i] == sentinel) {
            swap(data[i], data[--sentinelPos]);
        } else {
            i++;
        }
    }

    dataIn = dataIn[0..sentinelPos + start + 1];
    uniqueInPlaceImpl(dataIn, start + swapPos + 1);
}

dsimcha 2009-10-07 19:27:25

Answer 17

A:

void f(vector<int>& data) {
    std::sort(data);
    data.erase(unique(data.begin(), data.end());
}

Olexiy 2009-10-07 19:45:02

that looks like c++, not c

Peter Recore 2009-10-08 16:54:05

yep, a bit :(Misread the question

Olexiy 2009-10-08 21:05:51

Answer 18

+1 A:

This can be done in one pass with an O(N log N) algorithm and no extra storage.

Proceed from element a[1] to a[N]. At each stage i, all of the elements to the left of a[i] comprise a sorted heap of elements a[0] through a[j]. Meanwhile, a second index j, initially 0, keeps track of the size of the heap.

Examine a[i] and insert it into the heap, which now occupies elements a[0] to a[j+1]. As the element is inserted, if a duplicate element a[k] is encountered having the same value, do not insert a[i] into the heap (i.e., discard it); otherwise insert it into the heap, which now grows by one element and now comprises a[0] to a[j+1], and increment j.

Continue in this manner, incrementing i until all of the array elements have been examined and inserted into the heap, which ends up occupying a[0] to a[j]. j is the index of the last element of the heap, and the heap contains only unique element values.

int algorithm(int[] a, int n)
{
    int   i, j;  

    for (j = 0, i = 1;  i < n;  i++)
    {
        // Insert a[i] into the heap a[0...j]
        if (heapInsert(a, j, a[i]))
            j++;
    }
    return j;
}  

bool heapInsert(a[], int n, int val)
{
    // Insert val into heap a[0...n]
    ...code omitted for brevity...
    if (duplicate element a[k] == val)
        return false;
    a[k] = val;
    return true;
}

Looking at the example, this is not exactly what was asked for since the resulting array preserves the original element order. But if this requirement is relaxed, the algorithm above should do the trick.

Loadmaster 2009-10-08 01:35:35

Answer 19

+1 A:

1. Using O(1) extra space, in O(n log n) time

This is possible, for instance:

first do an in-place O(n log n) sort
then walk through the list once, writing the first instance of every back to the beginning of the list

I believe ejel's partner is correct that the best way to do this would be an in-place merge sort with a simplified merge step, and that that is probably the intent of the question, if you were eg. writing a new library function to do this as efficiently as possible with no ability to improve the inputs, and there would be cases it would be useful to do so without a hash-table, depending on the sorts of inputs. But I haven't actually checked this.

2. Using O(lots) extra space, in O(n) time

declare a zero'd array big enough to hold all integers
walk through the array once
set the corresponding array element to 1 for each integer.
If it was already 1, skip that integer.

This only works if several questionable assumptions hold:

it's possible to zero memory cheaply, or the size of the ints are small compared to the number of them
you're happy to ask your OS for 256^sizepof(int) memory
and it will cache it for you really really efficiently if it's gigantic

It's a bad answer, but if you have LOTS of input elements, but they're all 8-bit integers (or maybe even 16-bit integers) it could be the best way.

3. O(little)-ish extra space, O(n)-ish time

As #2, but use a hash table.

4. The clear way

If the number of elements is small, writing an appropriate algorithm is not useful if other code is quicker to write and quicker to read.

Eg. Walk through the array for each unique elements (ie. the first element, the second element (duplicates of the first having been removed) etc) removing all identical elements. O(1) extra space, O(n^2) time.

Eg. Use library functions which do this. efficiency depends which you have easily available.

2009-10-09 09:40:03

Answer 20

+1 A:

How about the following?

int* temp = malloc(sizeof(int)*len);
int count = 0;
int x =0;
int y =0;
for(x=0;x<len;x++)
{
    for(y=0;y<count;y++)
    {
        if(*(temp+y)==*(array+x))
        {
            break;
        }
    }
    if(y==count)
    {
        *(temp+count) = *(array+x);
        count++;
    }
}
memcpy(array, temp, sizeof(int)*len);

I try to declare a temp array and put the elements into that before copying everything back to the original array.

Charith 2010-06-10 12:38:25

ansaurus

tags:

views:

answers:

Algorithm: efficient way to remove duplicate integers from an array

related questions