views:

87

answers:

4

Why does the iteration order of a Python set (with the same contents) vary from run to run, and what are my options for making it consistent from run to run?

I understand that the iteration order for a Python set is arbitrary. If I put 'a', 'b', and 'c' into a set and then iterate them, they may come back out in any order.

What I've observed is that the order remains the same within a run of the program. That is, if my program iterates the same set twice in a row, I get the same order both times. However, if I run the program twice in a row, the order changes from run to run.

Unfortunately, this breaks one of my automated tests, which simply compares the output from two runs of my program. I don't care about the actual order, but I would like it to be consistent from run to run.

The best solution I've come up with is:

  1. Copy the set to a list.
  2. Apply an arbitrary sort to the list.
  3. Iterate the list instead of the set.

Is there a simpler solution?

Note: I've found similar questions on StackOverlow, but none that address this specific issue of getting the same results from run to run.

+4  A: 

What you want isn't possible. Arbitrary means arbitrary.

My solution would be the same as yours, you have to sort the set if you want to be able to compare it to another one.

Turtle
I guess I assumed arbitrary meant that it would depend on the contents, not on the phase of the moon.
Adrian McCarthy
Well, there's arbitrary, then there's non-deterministic. There's probably a way you can determine what the order in the set will be, but I'd wager that's more trouble than it's worth. Check for an ordered-set, or similar in python...
JoshD
Even if it were consistent from run to run, there would be no guarantee it would be consistent from machine to machine, python version to python version, cpython versus jython etc.
Mike Axiak
And 'the same contents' is no guarantee either, even in the same Python build on the same machine. Items are inserted based on hash value. When multiple items have the same hash value, they get inserted in different places based on the order they are inserted in. Deletions of items can cause more different orderings. And then there's items whose hash value depends on their memory location, which makes it different between runs. There's not much you can do, other than use `sorted()` for a convenient way to write the 3 steps.
Thomas Wouters
I've marked this as the answer, but I'm still curious about what changes from run to run to cause this behavior. Does Python use a PRNG to generate the constants for a hashing function? I'm talking about the same program, with the same inputs, run twice in a row, on the same machine with the same Python interpretter.
Adrian McCarthy
Not certain, but I'd guess that at some point things are getting hashed by address (i.e. by id()), and some asynchronous thing in the system is perturbing the memory manager in different ways from run to run. I would not expect cpython to involve a PRNG in hashing at all.
Russell Borogove
A: 

Contrary to sets, lists have always a guaranteed order, so you could toss the set and use the list.

knitti
Yup, but I'm creating the sets using several set operations, (unions, intersections, etc.). Those are less efficient as lists.
Adrian McCarthy
+1  A: 

The set's iteration order depends not only its contents, but on the order in which the items were inserted into the set, and whether there were deletions along the way. So you can create two different sets, using different insertions and deletions, and end up with the same set at the end, but with different iteration orders.

As others have said: if you care about the order of the set, you have to create a sorted list from it.

Ned Batchelder
Running my program twice in a row with the same input, involves the same sequence of insertions, deletions, and set operations, yet the iteration order still changes. It's as though there's something more involved, like the time of day, process ID, or something else that varies from run to run.
Adrian McCarthy
Thomas Wouters points out in his comment above that some classes use id() in the hash function, meaning the object's hash depends on its memory address, and who knows what might make that differ. If you are using your own classes, you can write your own __hash__ function to get rid of some of that indeterminancy, but you're probably better off simply sorting the results anyway.
Ned Batchelder
+2  A: 

Use the symmetric_difference (^) operator on your two sets to see if there are any differences:

In [1]: s1 = set([5,7,8,2,1,9,0])
In [2]: s2 = set([9,0,5,1,8,2,7])
In [3]: s1
Out[3]: set([0, 1, 2, 5, 7, 8, 9])
In [4]: s2
Out[4]: set([0, 1, 2, 5, 7, 8, 9])
In [5]: s1 ^ s2
Out[5]: set()
Brian C. Lane