views:

353

answers:

6

Ive always been a bit confused on this, possibly due to my lack of understanding in compilers. But lets use python as an example. If we had some large list of numbers called numlist and wanted to get rid of any duplicates, we could use a set operator on the list, example set(numlist). In return we would have a set of our numbers. This operation to the best of my knowledge will be done in O(n) time. Though if I were to create my own algorithm to handle this operation, the absolute best I could ever hope for is O(n^2).

What I don't get is, what allows a internal operation like set() to be so much faster then an external to the language algorithm. The checking still needs to be done, don't they?

+1  A: 

The complexity bound of an algorithm is completely unrelated to whether it is implemented 'internally' or 'externally'

Mitch Wheat
A: 

Off hand I can't think of how to do this in O(n), but here is the cool thing:

The difference between n^2 and n is sooo massive that the difference between you implementing it and python implementing is tiny compared to the algorithm used to implement it. n^2 is always worse than O(n), even if the n^2 one is in C and the O(n) one is in python. You should never think that kind of difference comes from the fact that you're not writing in a low level language.

That said, if you want to implement your own, you can do a sort then remove dups. the sort is n*ln(n) and the remove dups in O(n)...

Chris H
Don't fall into the trap of thinking that O(whatever) tells you everything about an algorithm speed. An O(1) algorithm could take 10 minutes to run...
David Johnstone
+1  A: 

Taking a list and turning it into a set through set() is O(n).

This is because set is implemented as a hash set. That means that to check if something is in the set or to add something to the set only takes O(1), constant time. Thus, to make a set from an iterable (like a list for example), you just start with an empty set and add the elements of the iterable one by one. Since there are n elements and each insertion takes O(1), the total time of converting an iterable to a set is O(n).

To understand how the hash implementation works, see the wikipedia artcle on hash tables

Danny Roberts
Are you sure they don't use a hash table to get linear time?
Eli
Sets are usually stored as hash tables.
Jason
You're both absolutely right--- I don't know what I was thinking.
Danny Roberts
so whats the complexity O(nlogn)?
Recursion
+2  A: 

You can do this in Θ(n) average time using a hash table. Lookup and insertion in a hash table are Θ(1) on average . Thus, you just run through the n items and for each one checking if it is already in the hash table and if not inserting the item.

What I don't get is, what allows a internal operation like set() to be so much faster then an external to the language algorithm. The checking still needs to be done, don't they?

The asymptotic complexity of an algorithm does not change if implemented by the language implementers versus being implemented by a user of the language. As long as both are implemented in a Turing complete language with random access memory models they have the same capabilities and algorithms implemented in each will have the same asymptotic complexity. If an algorithm is theoretically O(f(n)) it does not matter if it is implemented in assembly language, C#, or Python on it will still be O(f(n)).

Jason
I still don't agree with this "on average" stuff. O(n) is supposed to say what happens as the size of your data sets approaches infinity and I don't care how perfect your hash is, you can't look up an infinite number of items in a hash table in O(1).
paxdiablo
@paxdiablo: You are making the mistake of confusing "arbitrarily large but finite" with "infinite." These are very different concepts. Asymptotic analysis never considers infinite data structures, just data structures that are arbitrarily large but still finite.
Jason
Okay. So, with a hashing function that only allows for a million slots, what is the big-O property for search for a key from an arbitrarily large 1 billion values. I can't see how you could claim that to be O(1) since it depends very much on N - you need to do a sequential search (or binary) within the slot to find your item, yes? Neither of those operation are O(1).
paxdiablo
Re hashing: The O(1) for hashing is always a bit tricky, and for good reason. The complexity is actually in the word-RAM model of computation, with the assumption that keys are single words. Thus arithmetic operations on the keys take constant time, as does indexing into arrays. The complexity can further be amortized (averaged over sequential operations), so n operations take O(n) time, but an individual operation might take a while. It might also be allowed to use randomization, but that's a side issue. The main point is that operations on the keys are assumed to be fast.
A. Rex
@paxdiablo: First, that's a poorly designed hash table as the load factor is too high. But, let's say you're going to force me to use a hash table with such a high load factor. Then I'll tell you I won't implement it using chaining in the hash table buckets. Instead, I'll hash again so that the buckets are themselves hash tables. This is known as dynamic perfect hashing and it gets around the problem of collisions. With such a design we are guaranteed average `O(1)` lookup. See http://en.wikipedia.org/wiki/Dynamic_perfect_hashing.
Jason
Re Turing completeness: if a language is Turing complete, it does NOT mean that it takes the same amount of time to finish, even up to big-O factors. Turing-completeness only means that anything you can do in one language, you can also do in the other, with no regard for complexity. Indeed, some things are faster in one model. For example, checking if the last character of the input is a "0" is faster on a random access machine than on a Turing machine, which must scan to the end in linear time. Similarly, Brainfuck is actually asymptotically slower than a language with random access.
A. Rex
@A. Rex: You're right; I was being careless. I have amended my thought more completely to a correct (at least I hope) and relevant statement. Thanks for the correction.
Jason
A: 

There are two issues here.

Time complexity (which is expressed in big O notation) is a formal measure of how long an algorithm takes to run for a given set size. It's more about how well an algorithm scales than about the absolute speed.

The actual speed (say, in milliseconds) of an algorithm is the time complexity multiplied by a constant (in an ideal world).

Two people could implement the same removal of duplicates algorithm with O(log(n)*n) complexity, but if one writes it in Python and the other writes it in optimised C, then the C program will be faster.

David Johnstone
+1  A: 

You can do this in O(n) in any language, basically as:

# Get min and max values O(n).

min = oldList[0]
max = oldList[0]
for i = 1 to oldList.size() - 1:
    if oldList[i] < min:
        min = oldList[i]
    if oldList[i] > max:
        max = oldList[i]

# Initialise boolean list O(n)

isInList = new boolean[max - min + 1]
for i = min to max:
    isInList[i] = false

# Change booleans for values in old list O(n)

for i = 0 to oldList.size() - 1:
    isInList[oldList[i] - min] = true

# Create new list from booleans O(n) (or O(1) based on integer range).

newList = []
for i = min to max:
    if isInList[i - min]:
        newList.append (i)

I'm assuming here that append is an O(1) operation, which it should be unless the implementer was brain-dead. So with k steps each O(n), you still have an O(n) operation.

Whether the steps are explicitly done in your code or whether they're done under the covers of a language is irrelevant. Otherwise you could claim that the C qsort was one operation and you now have the holy grail of an O(1) sort routine :-)

As many people have discovered, you can often trade off space complexity for time complexity. For example, the above only works because we're allowed to introduce the isInList and newList variables. If this were not allowed, the next best solution may be sorting the list (probably no better the O(n log n)) followed by an O(n) (I think) operation to remove the duplicates.

An extreme example, you can use that same extra-space method to sort an arbitrary number of 32-bit integers (say with each only having 255 or less duplicates) in O(n) time, provided you can allocate about four billion bytes for storing the counts.

Simply initialise all the counts to zero and run through each position in your list, incrementing the count based on the number at that position. That's O(n).

Then start at the beginning of the list and run through the count array, placing that many of the correct value in the list. That's O(1), with the 1 being about four billion of course but still constant time :-)

That's also O(1) space complexity but a very big "1". Typically trade-offs aren't quite that severe.

paxdiablo