ansaurus

Question

deepcopy and python - tips to avoid using it?

Answer 1

+5 A:

Okay, simplest things first:

deepcopy is slow in general since it has to do a lot of internal bookkeeping to copy pathological cases like objects containing themselves in a sane way. See, for instance, this page, or take a look at the source code of deepcopy in copy.py that is somewhere in your Python path.
sorted is fast, since it is implemented in C. Much faster than an equivalent sort in Python.

Now, some more thing about Python's reference counting behaviour as you asked in your comment. In Python, variables are references. When you say a=1, think about it has having 1 as an object existing on its own, and a is just a tag attached to it. In some other languages like C, variables are containers (not tags), and when you do a=1, you actually put 1 into a. This does not hold for Python, where variables are references. This has some interesting consequences that you may have also stumbled upon:

>>> a = []      # construct a new list, attach a tag named "a" to it
>>> b = a       # attach a tag named "b" to the object which is tagged by "a"
>>> a.append(1) # append 1 to the list tagged by "a"
>>> print b     # print the list tagged by "b"
[1]

This behaviour is seen because lists are mutable objects: you can modify a list after you have created it, and the modification is seen when accessing the list through any of the variables that refer to it. The immutable equivalents of lists are tuples:

>>> a = ()      # construct a new tuple, attach a tag named "a" to it
>>> b = a       # now "b" refers to the same empty tuple as "a"
>>> a += (1, 2) # appending some elements to the tuple
>>> print b
()

Here, a += (1, 2) creates a new tuple from the existing tuple referred to by a, plus another tuple (1, 2) that is constructed on-the-fly, and a is adjusted to point to the new tuple, while of course b still refers to the old tuple. The same happens with simple numeric additions like a = a+2: in this case, the number originally pointed to by a is not mutated in any way, Python "constructs" a new number and moves a to point to the new number. So, in a nutshell: numbers, strings and tuples are immutable; lists, dicts and sets are mutable. User-defined classes are in general mutable unless you ensure explicitly that the internal state cannot be mutated. And there's frozenset, which is an immutable set. Plus many others of course :)

I don't know why your original code didn't work, but probably you hit a behaviour related to the code snippet I've shown with the lists as your PointDistance class is also mutable by default. An alternative could be the namedtuple class from collections, which constructs a tuple-like object whose fields can also be accessed by names. For instance, you could have done this:

from collections import namedtuple
PointDistance = namedtuple("PointDistance", "point distance")

This creates a PointDistance class for you that has two named fields: point and distance. In your main for loop, you can assign these appropriately. Since the point objects pointed to by the point fields won't be modified during the course of your for loop, and distance is a number (which is, by definition, immutable), you should be safe doing this way. But yes, in general, it seems like simply using sorted is faster since sorted is implemented in C. You might also be lucky with the heapq module, which implements a heap data structure backed by an ordinary Python list, therefore it lets you find the top k elements easily without having to sort the others. However, since heapq is also implemented in Python, chances are that sorted works better unless you have a whole lot of points.

Finally, I'd like to add that I never had to use deepcopy so far, so I guess there are ways to avoid it in most cases.

Tamás 2010-06-15 09:26:44

many thanks for taking the time to write such a detailed, clear and concise explanation. i feel like i have much better understanding of the 'why' of things as a result.

blackkettle 2010-06-15 09:39:27

ansaurus

tags:

views:

answers:

deepcopy and python - tips to avoid using it?

related questions