tags:

views:

154

answers:

5

The question arose when answering to another SO question (there).

When I iterate several times over a python set (without changing it between calls), can I assume it will always return elements in the same order? And if not, what is the rationale of changing the order ? Is it deterministic, or random? Or implementation defined?

And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?

The underlying question is if python set iteration order only depends on the algorithm used to implement sets, or also on the execution context?

A: 

The definition of a set is unordered, unique elements ("Unordered collections of unique elements"). You should care only about the interface, not the implementation. If you want an ordered enumeration, you should probably put it into a list and sort it.

There are many different implementations of Python. Don't rely on undocumented behaviour, as your code could break on different Python implementations.

Joe
+5  A: 

There's no formal guarantee about the stability of sets (or dicts, for that matter.) However, in the CPython implementation, as long as nothing changes the set, the items will be produced in the same order. Sets are implemented as open-addressing hashtables (with a prime probe), so inserting or removing items can completely change the order (in particular, when that triggers a resize, which reorganizes how the items are laid out in memory.) You can also have two identical sets that nonetheless produce the items in different order, for example:

>>> s1 = {-1, -2}
>>> s2 = {-2, -1}
>>> s1 == s2
True
>>> list(s1), list(s2)
([-1, -2], [-2, -1])

Unless you're very certain you have the same set and nothing touched it inbetween the two iterations, it's best not to rely on it staying the same. Making seemingly irrelevant changes to, say, functions you call inbetween could produce very hard to find bugs.

Thomas Wouters
I would say that the stability of dict at least is guaranteed. The docs say: "If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond." This implies that calling any of those methods repeatedly will return the same sequence if the dict is not modified. It also says that iter(dict) is a shortcut for dict.iterkeys()
Dave Kirby
I said "no *formal* guarantee". The dict docs can change (and such details have indeed changed in the past, not to mention differed between implementations); the "formal" (but rather terse) language specification at http://docs.python.org/ref doesn't mention it either way.
Thomas Wouters
A: 

It’s definitely implementation defined. The specification of a set says only that

Being an unordered collection, sets do not record element position or order of insertion.

Why not use OrderedDict to create your own OrderedSet class?

jleedev
@jleedev: I'm not saying I will use that behavior, just wondering where the bug seen by another poster could be coming from. Also there is a very similar property for dict that *is* guaranteed by python documentation (see http://stackoverflow.com/questions/3666237/are-order-of-keys-and-values-in-python-dictionary-guaranteed-to-be-the-same). Why there should be such differences between sets and dict is quite surprising.
kriss
A: 

As pointed out, this is strictly an implementation detail.

But as long as you don’t change the structure between calls, there should be no reason for a read-only operation (= iteration) to change with time: no sane implementation does that. Even randomized (= non-deterministic) data structures that can be used to implement sets (e.g. skip lists) don’t change the reading order when no changes occur.

So, being rational, you can safely rely on this behaviour.

(I’m aware that certain GCs may reorder memory in a background thread but even this reordering will not be noticeable on the level of data structures, unless a bug occurs.)

Konrad Rudolph
Being rational, we would also try to capture this assumption in a unit test so the program does not break in mysterious ways at a later date. :)
jleedev
@jleedev: True, but unfortunately I can easily see such a unit test fail to flag the error: if the behaviour is indeed nondeterministc, writing a reliable unit test for this behaviour will be incredibly hard. For example, I had a unit test suite on a parallel program that would fail only about once out of a hundred calls due to a race condition. In 99% of the cases, it would run through, even though it was a *very* thorough test suite.
Konrad Rudolph
+1  A: 

And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?

I can answer this part of the question now after a quick experiment. Using the following code:

class Foo(object) :
  def __init__(self,val) :
    self.val = val
  def __repr__(self) :
    return str(self.val)

x = set()
for y in range(500) :
  x.add(Foo(y))
print list(x)[-10:]

I can trigger the behaviour that I was asking about in the other question. If I run this repeatedly then the output changes, but not on every run. It seems to be "weakly random" in that it changes slowly. This is certainly implementation dependent so I should say that I'm running the macports Python2.6 on snow-leopard. While the program will output the same answer for long runs of time, doing something that affects the system entropy pool (writing to the disk mostly works) will somethimes kick it into a different output.

The class Foo is just a simple int wrapper as experiments show that this doesn't happen with sets of ints. I think that the problem is caused by the lack of __eq__ and __hash__ members for the object, although I would dearly love to know the underlying explanation / ways to avoid it. Also useful would be some way to reproduce / repeat a "bad" run. Does anyone know what seed it uses, or how I could set that seed?

Amoss
This is terribly easy to explain: because of the lack of `__eq__` and `__hash__`, your objects hash based on `id()`, and the id for the objects changes between runs. You aren't repeatedly printing `list()[-10:]` of the *same* set, just one that was created the same way.
Thomas Wouters
Why do the default implementations of __eq__ and __hash__ rely on the random module... It seems as if they use id() + something else. If I methods that use id() explicitly then the behaviour changes.
Amoss
They don't rely on the `random` module at all. They only use the `id` of the object. The `id` of an object is the memory address, so *between runs* it will differ because of all manner of reasons. If I implement `__hash__` and `__eq__` in terms of `id()`, the behaviour is the same. I'm not sure what you did, but I guess you did something different; you'll have to put up the code to tell. Or, you can stop worrying about the behaviour *between runs*, because there's no kind of guarantee about the order of sets or dicts in that case anyway.
Thomas Wouters
Thanks for the comments. I'll wrap up some test code and ask a separate question.
Amoss