ansaurus

Question

Optimizing Python code with many attribute and dictionary lookups

Answer 1

A:

Try just directly accessing the dict and catch KeyErrors, it might be faster depending on your hit/miss ratio:

# cache this object
ignode = interaction_graph.node
neighbor_z_scores = []
for neighbor in node_neighbors:
    try:
        neighbor_z_scores.append(ignode[neighbor]['weight'])
    except KeyError:
        pass

or with the getdefault and list comprehension:

sentinel = object()
# cache this object 
ignode = interaction_graph.node

neighbor_z_scores = (ignode[neighbor]['weight'] for neighbor in node_neighbors)
# using identity testing, it's slightly faster
neighbor_z_scores = (neighbor for neighbor in neighbor_z_scores if neighbor is not sentinel)

Lie Ryan 2010-04-05 19:05:20

There will be many more neighbors that are not members of `selected_nodes` than are members, so I filter against this criterion first. This means fewer `'weight'` lookups. I can guarantee that all `'weight'` lookups will succeed, so there's no benefit in a try-except clause there. `ignode` is very a good idea, though, and will many lookups for that attribute.

gotgenes 2010-04-05 19:29:26

Answer 2

+1 A:

How about keeping the iteration order of interaction_graph.neighbors_iter(node) sorted (or partially sorted using collections.heapq)? Since you're just trying to find the max value, you can iterate node_neighbors in descending order, the first node that is in selected_node must be the max in selected_node.

Second, how often will selected_node changes? If it changes rarely, you can save a lot of iterations by having a list of "interaction_graph.node[neighbor] for x in selected_node" instead of having to rebuild this list every time.

EDIT: to reply on the comments

A sort() would take O(n log n)

Not necessarily, you're looking too much at your textbook. Despite what your textbook says, you can sometimes break the O(n log n) barrier by exploiting certain structure of your data. If you keep your list of neighbors in a naturally sorted data structure in the first place (e.g. heapq, binary tree), you don't need to re-sort at every iteration. Of course this is a space-time tradeoff, since you will need to store redundant lists of neighbors and there is code complexity to ensure that the list of neighbors is updated when the neighbors changes.

Also, python's list.sort(), which uses timsort algorithm, is very fast for nearly sorted data (could average O(n) in certain cases). It still doesn't break O(n log n), that much has been proven to be mathematically impossible.

You need to profile before dismissing a solution as not likely to improve performance. When doing extreme optimizations, you will likely find that in certain very special edge cases old and slow bubble sort may win over a glorified quicksort or mergesort.

Lie Ryan 2010-04-05 22:52:18

Sorting is an interesting idea. A `sort()` would take `O(n log n)` (where `n` is the number of neighbors); that would be more expensive than a linear search. According to the `heapq` docs, `heapify()` is `O(n)`, but each pop I believe is either `O(log n)` or `O(n)` (unsure), which means in a worst case scenario it would still be twice as many operations as a simple linear loop through the neighbors.

gotgenes 2010-04-06 03:09:39

To address the second point, unfortunately, `selected_nodes` changes with every call to this function, so it's not a candidate for caching. Good thought, though.

gotgenes 2010-04-06 03:11:21

Answer 3

+1 A:

I don't see why your "weight" lookups have to be in the form of ["weight"] (nodes are dictionaries?) instead of .weight (nodes are objects).

If your nodes are objects, and don't have a lot of fields, you can take advantage of the __slots__ directive to optimize their storage:

class Node(object):
    # ... class stuff goes here ...

    __slots__ = ('weight',) # tuple of member names.

EDIT: So I looked at the NetworkX link you provided, and there are several things that bother me. First is that, right at the top, the definition of "dictionary" is "FIXME".

Overall, it seems insistent on using dictionaries, rather than using classes that can be subclassed, to store attributes. While attribute lookup on an object may be essentially a dictionary lookup, I don't see how working with an object can be worse. If anything, it could be better since an object attribute lookup is more likely to be optimized, because:

object attribute lookups are so common,
the keyspace for object attributes is far more restricted than for dictionary keys, thus an optimized comparison algorithm can be used in the search, and
objects have the __slots__ optimization for exactly these cases, where you have an object with only a couple fields and need optimized access to them.

I frequently use __slots__ on classes that represent coordinates, for example. A tree node would seem, to me, another obvious use.

So that's why when I read:

node
A node can be any hashable Python object except None.

I think, okay, no problem, but then immediately following is

node attribute
Nodes can have arbitrary Python objects assigned as attributes by using keyword/value pairs when adding a node or assigning to the G.node[n] attribute dictionary for the specified node n.

I think, if a node needs attributes, why would it be stored separately? Why not just put it in the node? Is writing a class with contentString and weight members detrimental? Edges seem even crazier, since they're dictated to be tuples and not objects which you could subclass.

So I'm rather lost as to the design decisions behind NetworkX.

If you're stuck with it, I'd recommend moving attributes from those dictionaries into the actual nodes, or if that's not an option, using integers for keys into your attribute dictionary instead of strings, so searches use a much faster comparison algorithm.

Finally, what if you combined your generators:

neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
        neighbor in node_neighbors if neighbor in selected_nodes)

Mike DeSimone 2010-04-05 23:08:06

Good question. The nodes themselves are simply Python strings (IDs) which represent biological entities. The weight attribute of each node is stored within the graph using its node attributes structure: http://networkx.lanl.gov/reference/glossary.html#term-node-attribute Since attribute lookups are essentially dictionary lookups, I'm not sure if there'd be a performance gain switching to a true attribute.

gotgenes 2010-04-06 02:44:24

The performance gain would be the avoidance of a string comparison, replacing it with an id comparison... or something like that. I didn't write the language, but I'm pretty certain `foo.bar` makes better bytecode than `{'bar': 123}['bar']`, since the former is a far, far more frequent case.

Mike DeSimone 2010-04-06 03:04:43

Mike, intuitively, I believe you, too, but the timing results I just produced in a comparison showed otherwise. http://bitbucket.org/gotgenes/interesting-python-timings/src/ The summary is, for Python 2.6.4, class attribute access < dictionary lookup < `__slots__` access < instance attribute access, in terms of time taken. Dictionary lookups are about 10% quicker than `__slots__`.

gotgenes 2010-04-07 20:04:53

Answer 4

A:

Without looking deeply into your code, try adding a little speed with itertools.

Add these at the module level:

import itertools as it, operator as op
GET_WEIGHT= op.attrgetter('weight')

Change:

neighbors_in_selected_nodes = (neighbor for neighbor in
        node_neighbors if neighbor in selected_nodes)

into:

neighbors_in_selected_nodes = it.ifilter(selected_node.__contains__, node_neighbors)

and:

neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
        neighbor in neighbors_in_selected_nodes)

into:

neighbor_z_scores = (
    it.imap(
        GET_WEIGHT,
        it.imap(
            interaction_graph.node.__getitem__,
            neighbors_in_selected_nodes)
    )
)

Do these help?

ΤΖΩΤΖΙΟΥ 2010-04-18 01:18:57

ansaurus

tags:

views:

answers:

Optimizing Python code with many attribute and dictionary lookups

related questions