views:

175

answers:

4

I've seen people say that set objects in python have O(1) membership-checking. How are they implemented internally to allow this? What sort of data structure does it use? What other implications does that implementation have?

Every answer here was really enlightening, but I can only accept one, so I'll go with the closest answer to my original question. Thanks all for the info!

+7  A: 

According to this thread:

Indeed, CPython's sets are implemented as something like dictionaries with dummy values (the keys being the members of the set), with some optimization(s) that exploit this lack of values

So basically a set uses a hashtable as it's underlying data structure. This explains the O(1) membership checking, since looking up an item in a hashtable is an O(1) operation.

If you are so inclined, you can even browse the CPython source code for set which, according to Achim Domma is mostly a cut-and-paste from the dict implementation.

Justin Ethier
IIRC, the original `set` implementation actually *was* `dict` with dummy values, and it got optimized later.
dan04
+8  A: 

I think its a common mistake, set lookup (or hashtable for that matter) are not O(1).
from the Wikipedia

In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n) comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with Ω(k) amortized comparisons per insertion and up to k comparisons for a successful lookup.

Related: Is a Java hashmap really O(1)?

Shay Erlichmen
But they do take constant time to look up items: python -m timeit -s "s = set(range(10))" "5 in s" 10000000 loops, best of 3: 0.0642 usec per loop <--> python -m timeit -s "s = set(range(10000000))" "5 in s"10000000 loops, best of 3: 0.0634 usec per loop ... and that's the largest set that doesn't throw MemoryErrors
THC4k
@THC4k All you proved is that looking up X is done in constant time, but it doesn't mean that time for looking up X+Y will take the same amount of time which is what O(1) is all about.
Shay Erlichmen
I thought it meant that the time to do a lookup was independent of the number of stored values.
intuited
@intuited: It does, but the test run above doesn't prove that you can look up "5" in the same time you can look up "485398", or some other number that might be in a horrible collision space. It's not about looking up the same element in a differently-sized hash in the same time (in fact, that's not required at all), but rather it's about whether you can access each entry in the same amount of time in the current table - something which is basically impossible for hash tables to accomplish since there will generally always be collisions.
Nick Bastin
In other words, the time to do a lookup depends on the number of stored values, because that increases the likelihood of collisions.
intuited
@intuited: no, that's incorrect. When the number of stored values increase, Python will automatically increase the size of the hashtable, and the rate of collision stays roughly constant. Assuming an evenly distributed O(1) hash algorithm, then hashtable lookup is *amortized* O(1). You might want to watch the video presentation "The Mighty Dictionary" http://python.mirocommunity.org/video/1591/pycon-2010-the-mighty-dictiona
Lie Ryan
@Lie Ryan: Yes, it depends on whether we're talking about the context of the original question (which asks about the CPython implementation) or the context of this answer, which is about "the simplest model" where "the hash does not resize". Though it's arguable that after increasing the size of the hashtable it's no longer the same hashtable, so I might get off on a technicality anyway in that case :)
intuited
+3  A: 

We all have easy access to the source, where the comment preceding set_lookkey() says:

/*
The basic lookup function used by all operations.
This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
Open addressing is preferred over chaining since the link overhead for
chaining would be substantial (100% with typical malloc overhead).

The initial probe index is computed as hash mod the table size. Subsequent
probe indices are computed as explained in Objects/dictobject.c.

All arithmetic on hash should ignore overflow.

Unlike the dictionary implementation, the lookkey functions can return
NULL if the rich comparison returns an error.
*/
static setentry *
set_lookkey(PySetObject *so, PyObject *key, register long hash)
{
...
gimel
+5  A: 

When people say sets have O(1) membership-checking, they are talking about the average case. In the worst case (when all hashed values collide) membership-checking is O(n). See the Python wiki on time complexity.

The Wikipedia article says the best case time complexity for a hash table that does not resize is O(1 + k/n). This result does not directly apply to Python sets since Python sets use a hash table that resizes.

A little further on the Wikipedia article says that for the average case, and assuming a simple uniform hashing function, the time complexity is O(1/(1-k/n)), where k/n can be bounded by a constant c<1.

Big-O refers only to asymptotic behavior as n → ∞. Since k/n can be bounded by a constant, c<1, independent of n,

O(1/(1-k/n)) is no bigger than O(1/(1-c)) which is equivalent to O(constant) = O(1).

So assuming uniform simple hashing, on average, membership-checking for Python sets is O(1).

unutbu
Thanks for the links and clear explanation! Thanks.
Daenyth
+1 just for the link to the time complexity wiki; I haven't read the rest of the answer yet.
intuited