views:

491

answers:

5

What is the most efficient way to remove duplicate items from an array under the constraint that axillary memory usage must be to a minimum, preferably small enough to not even require any heap allocations? Sorting seems like the obvious choice, but this is clearly not asymptotically efficient. Is there a better algorithm that can be done in place or close to in place? If sorting is the best choice, what kind of sort would be best for something like this?

A: 

If you sort the array, you will still need another pass to remove duplicates, so the complexity is O(N*N) in the worst case (assuming Quicksort), or O(N*sqrt(N)) using Shellsort.

You can achieve O(N*N) by simply scanning the array for each element removing duplicates as you go.

Here is an example in Lua:

function removedups (t)
    local result = {}
    local count = 0
    local found
    for i,v in ipairs(t) do
     found = false
     if count > 0 then
         for j = 1,count do
           if v == result[j] then found = true; break end
       end
     end
      if not found then 
          count = count + 1
          result[count] = v 
      end
    end
    return result, count
end
Doug Currie
+1  A: 

Keeping auxillary memory usage to a minimum, your best bet would be to do an efficient sort to get them in order, then do a single pass of the array with a FROM and TO index.

You advance the FROM index every time through the loop. You only copy the element from FROM to TO (and increment TO) when the key is different from the last.

With Quicksort, that'll average to O(n-log-n) and O(n) for the final pass.

paxdiablo
A: 

I don't see any way to do this without something like a bubblesort. When you find a dupe, you need to reduce the length of the array. Quicksort is not designed for the size of the array to change.

This algorithm is always O(n^2) but it also use almost no extra memory -- stack or heap.

// returns the new size
int bubblesqueeze(int* a, int size) {
   for (int j = 0; j < size - 1; ++j) {
      for (int i = j + 1; i < size; ++i) {
          // when a dupe is found, move the end value to index j
          // and shrink the size of the array
          while (i < size && a[i] == a[j]) {
             a[i] = a[--size];
          }
          if (i < size && a[i] < a[j]) {
             int tmp = a[j];
             a[j] = a[i];
             a[i] = tmp;
          }
      }
   }
   return size;
}
jmucchiello
A: 

Is you have two different var for traversing a datadet insted of just one then you can limit the output by dismissing all diplicates that currently are already in the dataset.

Obvious this example in C is not an efficiant sorting algorith but it is just an example on one way to look at the probkem.

You could also blindly sort the data first and then relocate the data for removing dups, but I'm not sure that would be faster.

#define ARRAY_LENGTH 15
int stop = 1;
int scan_sort[ARRAY_LENGTH] = {5,2,3,5,1,2,5,4,3,5,4,8,6,4,1};

void step_relocate(char tmp,char s,int *dataset)
{
    for(;tmp<s;s--)
     dataset[s] = dataset[s-1];
}
int exists(int var,int *dataset)
{
    int tmp=0;
    for(;tmp < stop; tmp++)
    {
     if( dataset[tmp] == var)
      return 1;/* value exsist */
     if( dataset[tmp] > var)
      tmp=stop;/* Value not in array*/
    }
    return 0;/* Value not in array*/
}
void main(void)
{
    int tmp1=0;
    int tmp2=0;
    int index = 1;
    while(index < ARRAY_LENGTH)
    {
     if(exists(scan_sort[index],scan_sort))
      ;/* Dismiss all values currently in the final dataset */
     else if(scan_sort[stop-1] < scan_sort[index])
     {
      scan_sort[stop] = scan_sort[index];/* Insert the value as the highest one */
      stop++;/* One more value adde to the final dataset */
     }
     else
     {
      for(tmp1=0;tmp1<stop;tmp1++)/* find where the data shall be inserted */
      {
       if(scan_sort[index] < scan_sort[tmp1])
       {
        index = index;
        break;
       }
      }
      tmp2 = scan_sort[index]; /* Store in case this value is the next after stop*/
      step_relocate(tmp1,stop,scan_sort);/* Relocated data already in the dataset*/
      scan_sort[tmp1] = tmp2;/* insert the new value */
      stop++;/* One more value adde to the final dataset */
     }
     index++;
    }
    printf("Result: ");
    for(tmp1 = 0; tmp1 < stop; tmp1++)
     printf( "%d ",scan_sort[tmp1]);
    printf("\n");
    system( "pause" );
}

I liked the problem so I wrote a simple C test prog for it as you can see above. Make a comment if I should elaborate or you see any faults.

eaanon01
A: 

I'll answer my own question since, after posting, I came up with a really clever algorithm to do this. It uses hashing, building something like a hash set in place. It's guaranteed to be O(1) in axillary space (the recursion is a tail call), and is typically O(N) time complexity. The algorithm is as follows:

  1. Take the first element of the array, this will be the sentinel.
  2. Reorder the rest of the array, as much as possible, such that each element is in the position corresponding to its hash. As this step is completed, duplicates will be discovered. Set them equal to sentinel.
  3. Move all elements for which the index is equal to the hash to the beginning of the array.
  4. Move all elements that are equal to sentinel, except the first element of the array, to the end of the array.
  5. What's left between the properly hashed elements and the duplicate elements will be the elements that couldn't be placed in the index corresponding to their hash because of a collision. Recurse to deal with these elements.

This can be shown to be O(N) provided no pathological scenario in the hashing: Even if there are no duplicates, approximately 2/3 of the elements will be eliminated at each recursion. Each level of recursion is O(n) where small n is the amount of elements left. The only problem is that, in practice, it's slower than a quick sort when there are few duplicates, i.e. lots of collisions. However, when there are huge amounts of duplicates, it's amazingly fast.

Edit: In current implementations of D, hash_t is 32 bits. Everything about this algorithm assumes that there will be very few, if any, hash collisions in full 32-bit space. Collisions may, however, occur frequently in the modulus space. However, this assumption will in all likelihood be true for any reasonably sized data set. If the key is less than or equal to 32 bits, it can be its own hash, meaning that a collision in full 32-bit space is impossible. If it is larger, you simply can't fit enough of them into 32-bit memory address space for it to be a problem. I assume hash_t will be increased to 64 bits in 64-bit implementations of D, where datasets can be larger. Furthermore, if this ever did prove to be a problem, one could change the hash function at each level of recursion.

Here's an implementation in the D programming language:

void uniqueInPlace(T)(ref T[] dataIn) {
    uniqueInPlaceImpl(dataIn, 0);
}

void uniqueInPlaceImpl(T)(ref T[] dataIn, size_t start) {
    if(dataIn.length - start < 2)
        return;

    invariant T sentinel = dataIn[start];
    T[] data = dataIn[start + 1..$];

    static hash_t getHash(T elem) {
        static if(is(T == uint) || is(T == int)) {
            return cast(hash_t) elem;
        } else static if(__traits(compiles, elem.toHash)) {
            return elem.toHash;
        } else {
            static auto ti = typeid(typeof(elem));
            return ti.getHash(&elem);
        }
    }

    for(size_t index = 0; index < data.length;) {
        if(data[index] == sentinel) {
            index++;
            continue;
        }

        auto hash = getHash(data[index]) % data.length;
        if(index == hash) {
            index++;
            continue;
        }

        if(data[index] == data[hash]) {
            data[index] = sentinel;
            index++;
            continue;
        }

        if(data[hash] == sentinel) {
            swap(data[hash], data[index]);
            index++;
            continue;
        }

        auto hashHash = getHash(data[hash]) % data.length;
        if(hashHash != hash) {
            swap(data[index], data[hash]);
            if(hash < index)
                index++;
        } else {
            index++;
        }
    }


    size_t swapPos = 0;
    foreach(i; 0..data.length) {
        if(data[i] != sentinel && i == getHash(data[i]) % data.length) {
            swap(data[i], data[swapPos++]);
        }
    }

    size_t sentinelPos = data.length;
    for(size_t i = swapPos; i < sentinelPos;) {
        if(data[i] == sentinel) {
            swap(data[i], data[--sentinelPos]);
        } else {
            i++;
        }
    }

    dataIn = dataIn[0..sentinelPos + start + 1];
    uniqueInPlaceImpl(dataIn, start + swapPos + 1);
}
dsimcha
How do all those hash tables not allocate space on the heap?
jmucchiello
You're not *creating* a hash table. You're just *reordering* the current array in place so that as many elements as possible are in the positions corresponding to their hash.
dsimcha