ansaurus

Question

remove duplicate elements in an array in O(n) in C/C++

Answer 1

+8 A:

If, and only if, the source array is sorted, this can be done in linear time:

std::unique(a, a + 5); //Returns a pointer to the new logical end of a.

Otherwise you'll have to sort first, which is n lg n.

Billy ONeal 2010-08-08 01:42:32

Just not true. Look at my answer.

Borealid 2010-08-08 01:44:52

Sorting isn't necessarily n lg n. Depending on the data, there might be O(n) sorts available (e.g. counting sort, bucket sort).

jamesdlin 2010-08-08 01:44:54

@Borealid: See my comment on your answer. @Jamesdin: True, but A. most such sorts don't perform well in practice, and B. I was assuming use of `std::sort`, which is a comparison sort.

Billy ONeal 2010-08-08 02:17:16

In-place means the only efficient sort available is heap sort. Everything else is `O(n^2)` or non-constant space. Most have both problems.

R.. 2010-08-08 03:36:03

@R. `std::sort` is in place and n lg n. It's usually a form of Introsort.

Billy ONeal 2010-08-08 03:40:50

To clarify, `sort` has n lg n average case complexity (and absolutely no worst case complexity requirement). `stable_sort` can be used to get n (lg n)^2 worst case complexity.

James McNellis 2010-08-08 04:16:27

@James: But I don't believe `stable_sort` is in-place... is it?

Billy ONeal 2010-08-08 14:05:16

@Billy: There isn't a space complexity requirement for `sort` or for `stable_sort`, I don't think.

James McNellis 2010-08-08 14:13:25

Answer 2

+3 A:

Yes. Because access (insertion or lookup) on a hashtable is O(1), you can remove duplicates in O(N).

Pseudocode:

hashtable h = {}
numdups = 0
for (i = 0; i < input.length; i++) {
    if (!h.contains(input[i])) {
        input[i-numdups] = input[i]
        h.add(input[i])
    } else {
        numdups = numdups + 1
    }

This is O(N).

Some commenters have pointed out that whether a hashtable is O(1) depends on a number of things. But in the real world, with a good hash, you can expect constant-time performance. And it is possible to engineer a hash that is O(1) to satisfy the theoreticians.

Borealid 2010-08-08 01:43:49

"Assuming noncolliding hash" is a very bold statement.

zneak 2010-08-08 01:45:20

@zneak: For any finite set of values there exists a noncolliding hash. And having a solution that approaches O(N) as your resources increase is really quite good. Besides, it's easy to produce a hash with minimal collisions in the average case.

Borealid 2010-08-08 01:46:45

Perhaps you should be explicit about adding the element to the hash as well...

dmckee 2010-08-08 01:47:34

@dmckee: Done, thank you. Obvious, but should still be there.

Borealid 2010-08-08 01:49:23

The O(1) complexity of a hashtable lookup/insertion is not contingent on a non-colliding hash function. (As per the birthday paradox, collisions will happen, quickly: that is for table of size m, when around log(m) elements have been inserted---with good probability.) The O(1) complexity is based off a combination of things: a reasonably good (close to uniform, without being collision-free) hash function and the ability to dynamically resize the table at small (eventually amortized) cost.

Jérémie 2010-08-08 02:02:04

You don't need to ensure no collisions. You only need to ensure that the number of collisions per access is bounded by a constant. This is kind-of more-or-less achieved if the hashtable grows by doubling - ie there are some claims which are IMO a little dubious, but they only really happen if your number of items is getting too close to the maximum size of hashtable your hash function can support. On the current transition from 32 to 64 bits maybe that can be a real-world problem, but normally it would mean the hashtable won't fit in memory anyway.

Steve314 2010-08-08 02:07:34

This isn't a terribly good solution. For one thing, your assumption of O(1) for the hashmap is wrong. There's no such thing as a non-colliding hash. But on top of that the solution is O(n) space.

wilhelmtell 2010-08-08 02:08:10

Well pointed out Jeremie.

Matt Joiner 2010-08-08 02:09:21

@Borealid: Your algorithm is worst case quadratic. Mine is worst case n lg n. Claiming your algorithm is always linear time is simply incorrect, though with a good hash it will be linear in the *average* case. But in algorithm design, we're usually talking about worst case, not best case, performance.

Billy ONeal 2010-08-08 02:16:14

-1 because the question explicitly specifies `in place`. This algorithm is not.

Billy ONeal 2010-08-08 02:18:03

@Jérémie: Entirely correct. I didn't tightly bound my constraints; I thought "good enough, and still correct". @Steve314: What you say is right, although you don't really back it up. @wilhelmtell: Your first statement is wrong. For any finite set of values there are actually an infinite number of noncolliding hashes. But, as pointed out in several comments, the solution's correctness doesn't hinge on that anyhow. And yes, this solution is O(N) space, but *only* because of the result array. Not because of the hashtable.

Borealid 2010-08-08 02:18:42

@Billy ONeal an @wilhelmtell: Amended to no longer require a second output array. Happy? Really, the idea isn't any different whether you do it in-place or out. But it looked prettier with a separate result array :-(.

Borealid 2010-08-08 02:25:31

@Billy ONeal: if the hash is chosen given knowledge of the input bounds (even *without* knowledge of the particular input) and with large space available, this algorithm is worst case linear time overall. If given knowledge of the particular input, it only needs O(N) space to be O(N) time.

Borealid 2010-08-08 02:32:00

@Borealid: No, you are incorrect. Insertion into a hash table, regardless of hash is worst case linear. Period. Therefore the algorithm is quadradic. The reason it's not in place is because of the hash table, not because of the result array you eliminated.

Billy ONeal 2010-08-08 02:34:53

@Billy ONeal: If there are guaranteed to be no hash collisions, then insertion into a hash table consists of two steps: First, compute the hash of the input. Second, insert the input into an array at the position indicated by the hash. Which of those two steps make it linear? And don't say "the hash algorithm is O(N)", because while true, each element is only hashed once. And whether or not using an auxiliary structure to hold *some* information is still "in-place" is debatable. I dare you to show me a truly "in-place" sort by that definition.

Borealid 2010-08-08 02:42:25

@Borealid: The problem with your statement is the first part -- the "if". Having a perfect hash is the best case for a hash table structure. Having a bad hash is the worst case.

Billy ONeal 2010-08-08 02:59:15

Hash tables, unless the hash function is amazingly horrid, are generally considered to have an *amortized* cost of O(1) for insertions. This is a reasonable solution. +1

kbrimington 2010-08-08 03:14:53

-1 for wrong answer. Hash table is not O(1) and does not remotely resemble in-place.

R.. 2010-08-08 03:21:13

-1 it doesn't solve the question, because it is not in-place, you need at least 2N space, but really much more if you want to avoid collisions. Regardless of that the use of a hashtable is overkill.

Ismael 2010-08-08 03:50:20

If only I had [insert something of tiny but nonzero value] for every time someone called hash tables O(1) on this site.....

R.. 2010-08-08 04:40:13

Look, it's still accurate to call hash tables O(1). With a uniform hash, as the table size approaches infinity, the number of expected collisions approaches zero. So in the strictest sense they are approaching average-case O(1), and **for this problem** they are approaching a worst-case O(1) (because a uniform sufficiently large hash will not store two nonidentical items at the same location, and we never store the same item twice here). If you were to actually run the code I've got above and graph the time vs input size, it would *not* scale superlinearly.

Borealid 2010-08-08 04:51:00

@Borealid, hash functions are inherently bound to probability. There is no infinity of anything in computer hardware, and practically no hash function can guarantee a collisionless operation. Good hash functions therefore operate in big theta of 1, not big oh of 1. There is no "if there are no collisions", because you can't make that assumption. The worst case is an order of n collisions, and n^2 operations. You are filling a table with an order of n elements, so that's O(n) space. Because of the table you have O(n) space.

wilhelmtell 2010-08-08 06:10:53

@Borealid if you say your algorithm is O(n) you're saying that inserting an element into the hash is O(1). You're saying it is impossible to have a collision, collisions don't exist. You have to understand that there is no such hash function in existence. I you insist on using a theoretical algorithm that has no practical implementation, I suggest you take the leap and use quantum computers to get a O(1) performance.

wilhelmtell 2010-08-08 06:16:55

While you're at it, you could simply put all the elements of the array inside a hash set and be done with it.

zneak 2010-08-08 06:55:03

@wilhelmtell - he's not assuming no collisions - only a constant bound to the number of collisions per op. @Borealid - wilhelmtell is saying that if you're incredibly unlucky every single key may have the exact same hash value, and for big O you have to assume that. With integer keys of the same width as the hash... it comes down to whether the machines limits are a valid way to place an O(1) bound or not, really. If so, no data structure can grow larger than your pointer-width-limited address space, so even a linked list has O(1) search.

Steve314 2010-08-08 09:07:14

I think Walter Bright made the point about hashtable performance deteriorating to the collision-handling performance in relation to a scripting language he wrote. Typical hashtables use linear lists for collision handling, but he used binary trees, and got better real-world performance as a result IIRC.

Steve314 2010-08-08 09:14:25

This algorithm can be managed in place. You need two accessors into the array, one for the read location, and one for the write location. The generalization is obvious.

dmckee 2010-08-08 14:20:18

@Borealid - in order for hashtable lookup to be O(1), you need to ensure that the ratio of the number of buckets to the number of entries never exceeds some fixed value. In practice, this means that you need expand the primary hash table when it gets too full, and if N is likely to exceed 2**32 you need to use 64-bit hash codes.

Stephen C 2010-08-08 14:43:53

Answer 3

A:

Take your example. If the array elements are bounded integer, you can create a lookup bitarray.

If you find an integer such as 3, turn the 3rd bit on. If you find an integer such as 5, turn the 5th bit on.

If the array contains elements rather than integer, or the element is not bounded, using a hashtable would be a good choice, since hashtable lookup cost is a constant.

SiLent SoNG 2010-08-08 01:45:32

Hashtable lookup cost is `O(n)` not `O(1)`, unless you have constraints on the data which ensure a bound on the number of collisions. Expected performance is constant-time, but **big-O means worst case by definition**.

R.. 2010-08-08 03:35:05

@R: Big-O notation does not refer to worst case. ... **the worst case or average case** running time or memory usage of an algorithm is often expressed as a function of the length of its input using big-O notation... See wiki http://en.wikipedia.org/wiki/Big_O_notation.

SiLent SoNG 2010-08-08 06:14:59

@R: In practice, some data structure has O(N) in worst case, but it has amortized average cost of O(1), it is still considered good. An example is vector: when it grows the cost is O(N) at worst, but on average it is O(1). My point is to show Big-O notation is a measurement of the *worst case or average case*, not only worst case.

SiLent SoNG 2010-08-08 06:17:26

You're confusing terms. If you are considering amortized runtime, you still have a worst case and an average case and a best case. Big-Oh by definition deals with the worst case behavior, amortized or not.

Dennis Zickefoose 2010-08-08 18:42:30

Answer 4

A:

The example you have given is a sorted array. It is possible only in that case (given your constant space constraint)

Rahul 2010-08-08 01:45:38

What constant space constraint? Does "in-place" imply a space constraint?

Steve314 2010-08-08 02:11:14

I consider in-place to imply O(1) space. Otherwise you can just make a copy as temporary storage, do the work in the copy, store it back on top of the original, and call it "in-place". That's bogus.

R.. 2010-08-08 03:22:41

Not necessarily. First off, what about O(log n) space. Second, updating in-place can still be an optimisation even if you still need O(n) or more extra space. For example, you can create then sort an array of references to your data, then use that to sort your data in-place in O(n) time. The initial sort is still O(n log n), but if swaps of references are substantially quicker than swaps of your data items, this can be beneficial. Space restrictions aren't the only reason for doing updates in-place.

Steve314 2010-08-08 08:44:48

Answer 5

+4 A:

Best case is O(n log n). Perform a heap sort on the original array: O(n log n) in time, O(1)/in-place in space. Then run through the array sequentially with 2 indices (source & dest) to collapse out repetitions. This has the side effect of not preserving the original order, but since "remove duplicates" doesn't specify which duplicates to remove (first? second? last?), I'm hoping that you don't care that the order is lost.

If you do want to preserve the original order, there's no way to do things in-place. But it's trivial if you make an array of pointers to elements in the original array, do all your work on the pointers, and use them to collapse the original array at the end.

Anyone claiming it can be done in O(n) time and in-place is simply wrong, modulo some arguments about what O(n) and in-place mean. One obvious pseudo-solution, if your elements are 32-bit integers, is to use a 4-gigabit bit-array (512 megabytes in size) initialized to all zeros, flipping a bit on when you see that number and skipping over it if the bit was already on. Of course then you're taking advantage of the fact that n is bounded by a constant, so technically everything is O(1) but with a horrible constant factor. However, I do mention this approach since, if n is bounded by a small constant - for instance if you have 16-bit integers - it's a very practical solution.

R.. 2010-08-08 03:32:45

Unicode code points (20.1 bit) are another case where the bit-array solution would be very practical - for example if you want to get a list of which characters are used in a text. In such a case, the bit-array might even be the ideal form for the final output.

R.. 2010-08-08 03:58:42

_ Anyone claiming it can be done in O(n) time and in-place is simply wrong, modulo some arguments about what O(n) and in-place mean_If you have proof, the CS journals are waiting on your brilliant insight.

2010-08-08 04:15:33

I don't have proof, just like nobody has proof that factoring huge numbers can't be done efficiently. I'd like to turn your challenge around: if anyone here has an `O(n)` time and in-place solution, the CS journals are waiting for their brilliant insight. In the mean time, I'm going to assume they're wrong.

R.. 2010-08-08 04:21:29

To clarify: proving it impossible, although impressive for its difficulty, would not revolutionize anything; it would just confirm what everyone already believes intuitively. On the other hand, an efficient solution (to either problem I mentioned) **would revolutionize computing**.

R.. 2010-08-08 04:37:11

Factoring, yes, removing duplicates on the unit-cost RAM in O(n) time with O(1) extra space, no.

2010-08-08 04:56:59

A solution to the latter would probably imply (without too much work) vastly-more-efficient algorithms for a number of other things. Even if it didn't directly imply it, the solution/proof would likely include ideas with such applicability elsewhere.

R.. 2010-08-08 06:16:28

Element distinctness problem might be relevant here, for proving Omega(nlogn) bound (but in the comparison model).

Moron 2010-08-08 07:02:03

Answer 6

+1 A:

The canonical implementation of the unique() algorithm looks like something similar to the following:

template<typename Fwd>
Fwd unique(Fwd first, Fwd last)
{
    if( first == last ) return first;
    Fwd result = first;
    while( ++first != last ) {
        if( !(*result == *first) )
            *(++result) = *first;
    }
    return ++result;
}

This algorithm takes a range of sorted elements. If the range is not sorted, sort it before invoking the algorithm. The algorithm will run in-place, and return an iterator pointing to one-past-the-last-element of the unique'd sequence.

If you can't sort the elements then you've cornered yourself and you have no other choice but to use for the task an algorithm with runtime performance worse than O(n).

This algorithm runs in O(n) runtime. That's big-oh of n, worst case in all cases, not amortized time. It uses O(1) space.

wilhelmtell 2010-08-08 05:41:15

thank you , that was helpful :)

pranay 2010-09-10 14:03:02

Answer 7

+2 A:

I'm going to suggest a variation on Borealids answer, but I'll point out up front that it's cheating. Basically, it only works assuming some severe constraints on the values in the array - e.g. that all keys are 32-bit integers.

Instead of a hash table, the idea is to use a bitvector. This is an O(1) memory requirement which should in theory keep Rahul happy (but won't). With the 32-bit integers, the bitvector will require 512MB (ie 2**32 bits) - assuming 8-bit bytes, as some pedant may point out.

As Borealid should point out, this is a hashtable - just using a trivial hash function. This does guarantee that there won't be any collisions. The only way there could be a collision is by having the same value in the input array twice - but since the whole point is to ignore the second and later occurences, this doesn't matter.

Pseudocode for completeness...

src = dest = input.begin ();
while (src != input.end ())
{
  if (!bitvector [*src])
  {
    bitvector [*src] = true;
    *dest = *src; dest++;
  }
  src++;
}
//  at this point, dest gives the new end of the array

Just to be really silly (but theoretically correct), I'll also point out that the space requirement is still O(1) even if the array holds 64-bit integers. The constant term is a bit big, I agree, and you may have issues with 64-bit CPUs that can't actually use the full 64 bits of an address, but...

Steve314 2010-08-08 10:26:24

+1 :-D I like this.

Borealid 2010-08-08 20:53:29

Described by R.. 7 hours earlier. http://stackoverflow.com/questions/3432760/remove-duplicate-elements-in-an-array-in-on-in-c-c/3432959#3432959

Ben Voigt 2010-08-11 23:42:45

Yes - and I definitely read that post, though not (that I remember) all of it. I plead human fallibility.

Steve314 2010-08-13 00:16:07

ansaurus

tags:

views:

answers:

remove duplicate elements in an array in O(n) in C/C++

related questions