views:

300

answers:

6

I have the following code in Python:

def point_to_index(point):
    if point not in points:
        points.append(point)
    return points.index(point)

This code is awfully inefficient, especially since I expect points to grow to hold a few million elements.

If the point isn't in the list, I traverse the list 3 times:

  1. look for it and decide it isn't there
  2. go to the end of the list and add a new element
  3. go to the end of the list until I find the index

If it is in the list, I traverse it twice: 1. look for it and decide it is there 2. go almost to the end of the list until I find the index

Is there any more efficient way to do this? For instance, I know that:

  • I'm more likely to call this function with a point that isn't in the list.
  • If the point is in the list, it's likelier to be near the end than in the beginning.

So if I could have the line:

if point not in points:

search the list from the end to the beginning it would improve performance when the point is already in the list.

However, I don't want to do:

if point not in reversed(points):

because I imagine that reversed(points) itself will come at a huge cost.

Nor do I want to add new points to the beginning of the list (assuming I knew how to do that in Python) because that would change the indices, which must remain constant for the algorithm to work.

The only improvement I can think of is to implement the function with only one pass, if possible from the end to the beginning. The bottom line is:

  • Is there a good way to do this?
  • Is there a better way to optimize the function?

Edit: I've gotten suggestions for implementing this with only one pass. Is there any way for index() to go from the end to the beginning?

Edit: People have asked why the index is critical. I'm trying to describe a 3D surface using the OFF file format. This format describes a surface using its vertices and faces. First the vertices are listed, and the faces are described using a list of indices of vertices. That's why once I add a vortex to the list, its index must not change.

Edit: There have been some suggestions (such as igor's) to use a dict. This is a good solution for scanning the list. However, when I'm done I need to print out the list in the same order it was created. If I use a dict, I need to print out its keys sorted by value. Is there a good way to do that?

Edit: I implemented www.brool.com's suggestion. This was the simplest and fastest. It is essentially an ordered Dict, but without the overhead. The performance is great!

+2  A: 
def point_to_index(point):
    try:
        return points.index(point)
    except:
        points.append(point)
        return len(points)-1

Update: Added in Nathan's exception code.

Evan Fosmark
Doesn't this do just the same for the case when the point doesn't exist? How about this in the exception case? `return len(points) - 1`? Is len() O(1) in Python?
Nathan Fellman
@Nathan - that's what I did. See my answer. Yes, len is O(1) for lists.
Triptych
This is a **huge** improvement! Before I ran for an hour and called this about 250000 times. Now it's been running for half an hour and the function has been called 350000 times, with only this change!
Nathan Fellman
+4  A: 

You want to use a set:

>>> x = set()
>>> x
set([])
>>> x.add(1)
>>> x
set([1])
>>> x.add(1)
>>> x
set([1])

A set contains only one instance of any item you add, and it will be a lot more efficient than iterating a list manually.

This wikibooks page looks like a good primer if you haven't used sets in Python before.

Mark Rushakoff
What about needing constant indices, as the OP said? No order in sets.
Triptych
If you ask the OP why he needs the index, you'll likely find it is implementation-dependent. I'm betting that a set is the correct answer, notwithstanding what the OP has asked for.
hughdbrown
@hugh wat? Just because he needs to test for membership doesn't mean a set is automatically the right data structure.
Triptych
@Triptych You're right in that it is not automatically the right answer, but the OP said in a comment he's making "[at least] 350,000" calls for membership, which is what sets and dictionaries excel at. We really can not know what the optimal answer is for the OP's actual problem, though, without further information on why, exactly, the lists and their indices are critical, so in his limited problem, solutions here will work. Like hughdbrown, I suspect with more information, a solution involving sets would perform better, but for right now, it's speculation.
gotgenes
+9  A: 

This will traverse at most once:

def point_to_index(point):
    try: 
        return points.index(point)
    except ValueError:
        points.append(point)
        return len(points)-1

You may also want to try this version, which takes into account that matches are likely to be near the end of the list. Note that reversed() has almost no cost even on very large lists - it does not create a copy and does not traverse the list more than once.

def point_to_index(point):
    for index, this_point in enumerate(reversed(points)):
        if point == this_point:
            return len(points) - (index+1)
    else:
        points.append(point)
        return len(points)-1

You might also consider keeping a parallel dict or set of points to check for membership, since both of those types can do membership tests in O(1). There would be, of course, a substantial memory cost.

Obviously, if the points were ordered somehow, you would have many other options for speeding this code up, notably using a binary search for membership tests.

Triptych
That's a good idea, except that I'm running into memory issues, so I don't want to add another copy. Does a `dict` have a concept of order?
Nathan Fellman
Also, is append() O(1)?
Nathan Fellman
Yes, both append() and len() are O(1) for lists.
Triptych
And no, no concept of order in `dict`s
Triptych
Unless you use an OrderedDict.
Evan Fosmark
@Evan I considered that, but an ordered dict would of course keep an ordered copy of the points, which in this case seems to incur an unacceptable memory cost.
Triptych
Your suggestion for using reversed() is good, but it has the disadvantage of reversing the indeces as well. However, now that I think about it, if there is any way to prepend an element to the beginning of a list instead of appending it to the end, I could return `len(points) - points.index(point)`
Nathan Fellman
Of course you are right about reversing the indices. I changed my code accordingly. Prepending is a bad idea - it's an O(n) operation versus O(1) for appending.
Triptych
I have to mention many respected programmers consider the use exception handling for logical flow of a program poor practice. Exception handling should be reserved for handling abnormal conditions, not normal ones. I understand it gives the desired behavior here, but try not to make it a habit.
gotgenes
@gotgenes - list.index calls it an error, so I treat it like one. My guess is that the language designers wanted to prevent people from truth testing the result of list.index, and chose to use an exception in this case
Triptych
@Triptych @gotgenes I have to agree with Triptych. If list.index returned, say, -1 or None or False when it failed, there would be no need for a try-except.
Corey D
+1  A: 

As others said, consider using set or dict. You don't explain why you need the indices. If they are needed only to assign unique ids to the points (and I can't easily come up with another reason for using them), then dict will indeed work much better, e.g.,

points = {}
def point_to_index(point):
    if point in points:
        return points[point]
    else:
       points[point] = len(points)
       return len(points) - 1
igor
Is there any way for me to get the list of points sorted by index?
Nathan Fellman
+3  A: 

If you're worried about memory usage, but want to optimize the common case, keep a dictionary with the last n points and their indexes. points_dict = dictionary, max_cache = size of the cache.

def point_to_index(point):
    try:
        return points_dict.get(point, points.index(point))
    except:
        if len(points) >= max_cache:
            del points_dict[points[len(points)-max_cache]]
        points.append(point)
        points_dict[points] = len(points)-1
        return len(points)-1
brool
This is essentially an ordered dict, like tonfa suggests below: http://stackoverflow.com/questions/1319254/what-is-the-most-efficient-way-to-add-an-element-to-a-list-only-if-isnt-there-ye/1321086#1321086
Nathan Fellman
Good god! This is so much faster than using only a list!
Nathan Fellman
Using a fixed-sized cache was a pretty good idea. Kudos.
Triptych
+1  A: 

What you really want is an ordered dict (key insertion determines the order):

tonfa