views:

98

answers:

1

Hi,

I have a very simple python routine that involves cycling through a list of roughly 20,000 latitude,longitude coordinates and calculating the distance of each point to a reference point.

def compute_nearest_points( lat, lon, nPoints=5 ):
    """Find the nearest N points, given the input coordinates."""

    points = session.query(PointIndex).all()
    oldNearest = []
    newNearest = []
    for n in xrange(nPoints):
        oldNearest.append(PointDistance(None,None,None,99999.0,99999.0))
        newNearest.append(obj2)

    #This is almost certainly an inappropriate use of deepcopy
    #  but how SHOULD I be doing this?!?!
    for point in points:
        distance = compute_spherical_law_of_cosines( lat, lon, point.avg_lat, point.avg_lon )
        k = 0
        for p in oldNearest:
            if distance < p.distance:
                newNearest[k] = PointDistance(
                    point.point, point.kana, point.english, point.avg_lat, point.avg_lon, distance=distance
                    )
                break
            else:
                newNearest[k] = deepcopy(oldNearest[k])
            k += 1
        for j in range(k,nPoints-1):
            newNearest[j+1] = deepcopy(oldNearest[j])
        oldNearest = deepcopy(newNearest)

    #We're done, now print the result
    for point in oldNearest:
        print point.station, point.english, point.distance

    return

I initially wrote this in C, using the exact same approach, and it works fine there, and is basically instantaneous for nPoints<=100. So I decided to port it to python because I wanted to use SqlAlchemy to do some other stuff.

I first ported it without the deepcopy statements that now pepper the method, and this caused the results to be 'odd', or partially incorrect, because some of the points were just getting copied as references(I guess? I think?) -- but it was still pretty nearly as fast as the C version.

Now with the deepcopy calls added, the routine does it's job correctly, but it has incurred an extreme performance penalty, and now takes several seconds to do the same job.

This seems like a pretty common job, but I'm clearly not doing it the pythonic way. How should I be doing this so that I still get the correct results but don't have to include deepcopy everywhere?

EDIT:
I've hit on a much simpler and faster solution,

def compute_nearest_points2( lat, lon, nPoints=5 ):
    """Find the nearest N points, given the input coordinates."""

    points = session.query(PointIndex).all()
    nearest = []

    for point in points:
        distance = compute_spherical_law_of_cosines( lat, lon, point.avg_lat, point.avg_lon )
        nearest.append( 
            PointDistance(
                point.point, point.kana, point.english, point.avg_lat, point.avg_lon, distance=distance
                )
            )

    nearest_points = sorted(nearest, key=lambda point: point.distance)[:nPoints]     
    for item in nearest_points:
        print item.point, item.english, item.distance
    return

So basically I'm just making a complete copy of the input and appending a new value - the distance from the reference point. Then I just apply 'sorted' to the resulting list, specifying that the sort key should be the distance property of the PointDistance object.

This is much faster than using deepcopy although I confess I don't really understand why. I guess it is down to the efficient C implementations python's "sorted"?

+5  A: 

Okay, simplest things first:

  1. deepcopy is slow in general since it has to do a lot of internal bookkeeping to copy pathological cases like objects containing themselves in a sane way. See, for instance, this page, or take a look at the source code of deepcopy in copy.py that is somewhere in your Python path.

  2. sorted is fast, since it is implemented in C. Much faster than an equivalent sort in Python.

Now, some more thing about Python's reference counting behaviour as you asked in your comment. In Python, variables are references. When you say a=1, think about it has having 1 as an object existing on its own, and a is just a tag attached to it. In some other languages like C, variables are containers (not tags), and when you do a=1, you actually put 1 into a. This does not hold for Python, where variables are references. This has some interesting consequences that you may have also stumbled upon:

>>> a = []      # construct a new list, attach a tag named "a" to it
>>> b = a       # attach a tag named "b" to the object which is tagged by "a"
>>> a.append(1) # append 1 to the list tagged by "a"
>>> print b     # print the list tagged by "b"
[1]

This behaviour is seen because lists are mutable objects: you can modify a list after you have created it, and the modification is seen when accessing the list through any of the variables that refer to it. The immutable equivalents of lists are tuples:

>>> a = ()      # construct a new tuple, attach a tag named "a" to it
>>> b = a       # now "b" refers to the same empty tuple as "a"
>>> a += (1, 2) # appending some elements to the tuple
>>> print b
()

Here, a += (1, 2) creates a new tuple from the existing tuple referred to by a, plus another tuple (1, 2) that is constructed on-the-fly, and a is adjusted to point to the new tuple, while of course b still refers to the old tuple. The same happens with simple numeric additions like a = a+2: in this case, the number originally pointed to by a is not mutated in any way, Python "constructs" a new number and moves a to point to the new number. So, in a nutshell: numbers, strings and tuples are immutable; lists, dicts and sets are mutable. User-defined classes are in general mutable unless you ensure explicitly that the internal state cannot be mutated. And there's frozenset, which is an immutable set. Plus many others of course :)

I don't know why your original code didn't work, but probably you hit a behaviour related to the code snippet I've shown with the lists as your PointDistance class is also mutable by default. An alternative could be the namedtuple class from collections, which constructs a tuple-like object whose fields can also be accessed by names. For instance, you could have done this:

from collections import namedtuple
PointDistance = namedtuple("PointDistance", "point distance")

This creates a PointDistance class for you that has two named fields: point and distance. In your main for loop, you can assign these appropriately. Since the point objects pointed to by the point fields won't be modified during the course of your for loop, and distance is a number (which is, by definition, immutable), you should be safe doing this way. But yes, in general, it seems like simply using sorted is faster since sorted is implemented in C. You might also be lucky with the heapq module, which implements a heap data structure backed by an ordinary Python list, therefore it lets you find the top k elements easily without having to sort the others. However, since heapq is also implemented in Python, chances are that sorted works better unless you have a whole lot of points.

Finally, I'd like to add that I never had to use deepcopy so far, so I guess there are ways to avoid it in most cases.

Tamás
many thanks for taking the time to write such a detailed, clear and concise explanation. i feel like i have much better understanding of the 'why' of things as a result.
blackkettle