ansaurus

Question

Answer 1

+7 A:

Indeed, CPython's sets are implemented as something like dictionaries with dummy values (the keys being the members of the set), with some optimization(s) that exploit this lack of values

So basically a set uses a hashtable as it's underlying data structure. This explains the O(1) membership checking, since looking up an item in a hashtable is an O(1) operation.

If you are so inclined, you can even browse the CPython source code for set which, according to Achim Domma is mostly a cut-and-paste from the dict implementation.

Justin Ethier 2010-10-16 14:47:43

IIRC, the original `set` implementation actually *was* `dict` with dummy values, and it got optimized later.

dan04 2010-10-16 16:52:09

Answer 2

+8 A:

I think its a common mistake, set lookup (or hashtable for that matter) are not O(1).
from the Wikipedia

In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n) comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with Ω(k) amortized comparisons per insertion and up to k comparisons for a successful lookup.

Related: Is a Java hashmap really O(1)?

Shay Erlichmen 2010-10-16 14:57:26

But they do take constant time to look up items: python -m timeit -s "s = set(range(10))" "5 in s" 10000000 loops, best of 3: 0.0642 usec per loop <--> python -m timeit -s "s = set(range(10000000))" "5 in s"10000000 loops, best of 3: 0.0634 usec per loop ... and that's the largest set that doesn't throw MemoryErrors

THC4k 2010-10-16 15:06:30

@THC4k All you proved is that looking up X is done in constant time, but it doesn't mean that time for looking up X+Y will take the same amount of time which is what O(1) is all about.

Shay Erlichmen 2010-10-16 15:15:21

I thought it meant that the time to do a lookup was independent of the number of stored values.

intuited 2010-10-16 15:49:21

@intuited: It does, but the test run above doesn't prove that you can look up "5" in the same time you can look up "485398", or some other number that might be in a horrible collision space. It's not about looking up the same element in a differently-sized hash in the same time (in fact, that's not required at all), but rather it's about whether you can access each entry in the same amount of time in the current table - something which is basically impossible for hash tables to accomplish since there will generally always be collisions.

Nick Bastin 2010-10-16 16:08:18

In other words, the time to do a lookup depends on the number of stored values, because that increases the likelihood of collisions.

intuited 2010-10-16 23:54:55

@intuited: no, that's incorrect. When the number of stored values increase, Python will automatically increase the size of the hashtable, and the rate of collision stays roughly constant. Assuming an evenly distributed O(1) hash algorithm, then hashtable lookup is *amortized* O(1). You might want to watch the video presentation "The Mighty Dictionary" http://python.mirocommunity.org/video/1591/pycon-2010-the-mighty-dictiona

Lie Ryan 2010-10-17 01:09:51

@Lie Ryan: Yes, it depends on whether we're talking about the context of the original question (which asks about the CPython implementation) or the context of this answer, which is about "the simplest model" where "the hash does not resize". Though it's arguable that after increasing the size of the hashtable it's no longer the same hashtable, so I might get off on a technicality anyway in that case :)

intuited 2010-10-17 02:13:06

Answer 3

+3 A:

We all have easy access to the source, where the comment preceding set_lookkey() says:

/*
The basic lookup function used by all operations.
This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
Open addressing is preferred over chaining since the link overhead for
chaining would be substantial (100% with typical malloc overhead).

The initial probe index is computed as hash mod the table size. Subsequent
probe indices are computed as explained in Objects/dictobject.c.

All arithmetic on hash should ignore overflow.

Unlike the dictionary implementation, the lookkey functions can return
NULL if the rich comparison returns an error.
*/
static setentry *
set_lookkey(PySetObject *so, PyObject *key, register long hash)
{
...

gimel 2010-10-16 14:59:38

Answer 4

+5 A:

When people say sets have O(1) membership-checking, they are talking about the average case. In the worst case (when all hashed values collide) membership-checking is O(n). See the Python wiki on time complexity.

The Wikipedia article says the best case time complexity for a hash table that does not resize is O(1 + k/n). This result does not directly apply to Python sets since Python sets use a hash table that resizes.

A little further on the Wikipedia article says that for the average case, and assuming a simple uniform hashing function, the time complexity is O(1/(1-k/n)), where k/n can be bounded by a constant c<1.

Big-O refers only to asymptotic behavior as n → ∞. Since k/n can be bounded by a constant, c<1, independent of n,

O(1/(1-k/n)) is no bigger than O(1/(1-c)) which is equivalent to O(constant) = O(1).

So assuming uniform simple hashing, on average, membership-checking for Python sets is O(1).

unutbu 2010-10-16 16:47:12

Thanks for the links and clear explanation! Thanks.

Daenyth 2010-10-16 16:51:09

+1 just for the link to the time complexity wiki; I haven't read the rest of the answer yet.

intuited 2010-10-16 23:56:16

ansaurus

tags:

views:

answers:

How is CPython's set() implemented?

related questions