views:

157

answers:

4

I have written a program in Python which spends a large amount of time looking up attributes of objects and values from dictionary keys. I would like to know if there's any way I can optimize these lookup times, potentially with a C extension, to reduce the time of execution, or if I need to simply re-implement the program in a compiled language.

The program implements some algorithms using a graph. It runs prohibitively slowly on our data sets, so I profiled the code with cProfile using a reduced data set that could actually complete. The vast majority of the time is being burned in one function, and specifically in two statements, generator expressions, within the function:

The generator expression at line 202 is

    neighbors_in_selected_nodes = (neighbor for neighbor in
            node_neighbors if neighbor in selected_nodes)

and the generator expression at line 204 is

    neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
            neighbor in neighbors_in_selected_nodes)

The source code for this function of context provided below.

selected_nodes is a set of nodes in the interaction_graph, which is a NetworkX Graph instance. node_neighbors is an iterator from Graph.neighbors_iter().

Graph itself uses dictionaries for storing nodes and edges. Its Graph.node attribute is a dictionary which stores nodes and their attributes (e.g., 'weight') in dictionaries belonging to each node.

Each of these lookups should be amortized constant time (i.e., O(1)), however, I am still paying a large penalty for the lookups. Is there some way which I can speed up these lookups (e.g., by writing parts of this as a C extension), or do I need to move the program to a compiled language?


Below is the full source code for the function that provides the context; the vast majority of execution time is spent within this function.

def calculate_node_z_prime(
        node,
        interaction_graph,
        selected_nodes
    ):
    """Calculates a z'-score for a given node.

    The z'-score is based on the z-scores (weights) of the neighbors of
    the given node, and proportional to the z-score (weight) of the
    given node. Specifically, we find the maximum z-score of all
    neighbors of the given node that are also members of the given set
    of selected nodes, multiply this z-score by the z-score of the given
    node, and return this value as the z'-score for the given node.

    If the given node has no neighbors in the interaction graph, the
    z'-score is defined as zero.

    Returns the z'-score as zero or a positive floating point value.

    :Parameters:
    - `node`: the node for which to compute the z-prime score
    - `interaction_graph`: graph containing the gene-gene or gene
      product-gene product interactions
    - `selected_nodes`: a `set` of nodes fitting some criterion of
      interest (e.g., annotated with a term of interest)

    """
    node_neighbors = interaction_graph.neighbors_iter(node)
    neighbors_in_selected_nodes = (neighbor for neighbor in
            node_neighbors if neighbor in selected_nodes)
    neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
            neighbor in neighbors_in_selected_nodes)
    try:
        max_z_score = max(neighbor_z_scores)
    # max() throws a ValueError if its argument has no elements; in this
    # case, we need to set the max_z_score to zero
    except ValueError, e:
        # Check to make certain max() raised this error
        if 'max()' in e.args[0]:
            max_z_score = 0
        else:
            raise e

    z_prime = interaction_graph.node[node]['weight'] * max_z_score
    return z_prime

Here are the top couple of calls according to cProfiler, sorted by time.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
156067701  352.313    0.000  642.072    0.000 bpln_contextual.py:204(<genexpr>)
156067701  289.759    0.000  289.759    0.000 bpln_contextual.py:202(<genexpr>)
 13963893  174.047    0.000  816.119    0.000 {max}
 13963885   69.804    0.000  936.754    0.000 bpln_contextual.py:171(calculate_node_z_prime)
  7116883   61.982    0.000   61.982    0.000 {method 'update' of 'set' objects}
A: 

Try just directly accessing the dict and catch KeyErrors, it might be faster depending on your hit/miss ratio:

# cache this object
ignode = interaction_graph.node
neighbor_z_scores = []
for neighbor in node_neighbors:
    try:
        neighbor_z_scores.append(ignode[neighbor]['weight'])
    except KeyError:
        pass

or with the getdefault and list comprehension:

sentinel = object()
# cache this object 
ignode = interaction_graph.node

neighbor_z_scores = (ignode[neighbor]['weight'] for neighbor in node_neighbors)
# using identity testing, it's slightly faster
neighbor_z_scores = (neighbor for neighbor in neighbor_z_scores if neighbor is not sentinel)
Lie Ryan
There will be many more neighbors that are not members of `selected_nodes` than are members, so I filter against this criterion first. This means fewer `'weight'` lookups. I can guarantee that all `'weight'` lookups will succeed, so there's no benefit in a try-except clause there. `ignode` is very a good idea, though, and will many lookups for that attribute.
gotgenes
+1  A: 

How about keeping the iteration order of interaction_graph.neighbors_iter(node) sorted (or partially sorted using collections.heapq)? Since you're just trying to find the max value, you can iterate node_neighbors in descending order, the first node that is in selected_node must be the max in selected_node.

Second, how often will selected_node changes? If it changes rarely, you can save a lot of iterations by having a list of "interaction_graph.node[neighbor] for x in selected_node" instead of having to rebuild this list every time.

EDIT: to reply on the comments

A sort() would take O(n log n)

Not necessarily, you're looking too much at your textbook. Despite what your textbook says, you can sometimes break the O(n log n) barrier by exploiting certain structure of your data. If you keep your list of neighbors in a naturally sorted data structure in the first place (e.g. heapq, binary tree), you don't need to re-sort at every iteration. Of course this is a space-time tradeoff, since you will need to store redundant lists of neighbors and there is code complexity to ensure that the list of neighbors is updated when the neighbors changes.

Also, python's list.sort(), which uses timsort algorithm, is very fast for nearly sorted data (could average O(n) in certain cases). It still doesn't break O(n log n), that much has been proven to be mathematically impossible.

You need to profile before dismissing a solution as not likely to improve performance. When doing extreme optimizations, you will likely find that in certain very special edge cases old and slow bubble sort may win over a glorified quicksort or mergesort.

Lie Ryan
Sorting is an interesting idea. A `sort()` would take `O(n log n)` (where `n` is the number of neighbors); that would be more expensive than a linear search. According to the `heapq` docs, `heapify()` is `O(n)`, but each pop I believe is either `O(log n)` or `O(n)` (unsure), which means in a worst case scenario it would still be twice as many operations as a simple linear loop through the neighbors.
gotgenes
To address the second point, unfortunately, `selected_nodes` changes with every call to this function, so it's not a candidate for caching. Good thought, though.
gotgenes
+1  A: 

I don't see why your "weight" lookups have to be in the form of ["weight"] (nodes are dictionaries?) instead of .weight (nodes are objects).

If your nodes are objects, and don't have a lot of fields, you can take advantage of the __slots__ directive to optimize their storage:

class Node(object):
    # ... class stuff goes here ...

    __slots__ = ('weight',) # tuple of member names.

EDIT: So I looked at the NetworkX link you provided, and there are several things that bother me. First is that, right at the top, the definition of "dictionary" is "FIXME".

Overall, it seems insistent on using dictionaries, rather than using classes that can be subclassed, to store attributes. While attribute lookup on an object may be essentially a dictionary lookup, I don't see how working with an object can be worse. If anything, it could be better since an object attribute lookup is more likely to be optimized, because:

  • object attribute lookups are so common,
  • the keyspace for object attributes is far more restricted than for dictionary keys, thus an optimized comparison algorithm can be used in the search, and
  • objects have the __slots__ optimization for exactly these cases, where you have an object with only a couple fields and need optimized access to them.

I frequently use __slots__ on classes that represent coordinates, for example. A tree node would seem, to me, another obvious use.

So that's why when I read:

node
A node can be any hashable Python object except None.

I think, okay, no problem, but then immediately following is

node attribute
Nodes can have arbitrary Python objects assigned as attributes by using keyword/value pairs when adding a node or assigning to the G.node[n] attribute dictionary for the specified node n.

I think, if a node needs attributes, why would it be stored separately? Why not just put it in the node? Is writing a class with contentString and weight members detrimental? Edges seem even crazier, since they're dictated to be tuples and not objects which you could subclass.

So I'm rather lost as to the design decisions behind NetworkX.

If you're stuck with it, I'd recommend moving attributes from those dictionaries into the actual nodes, or if that's not an option, using integers for keys into your attribute dictionary instead of strings, so searches use a much faster comparison algorithm.

Finally, what if you combined your generators:

neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
        neighbor in node_neighbors if neighbor in selected_nodes)
Mike DeSimone
Good question. The nodes themselves are simply Python strings (IDs) which represent biological entities. The weight attribute of each node is stored within the graph using its node attributes structure: http://networkx.lanl.gov/reference/glossary.html#term-node-attribute Since attribute lookups are essentially dictionary lookups, I'm not sure if there'd be a performance gain switching to a true attribute.
gotgenes
The performance gain would be the avoidance of a string comparison, replacing it with an id comparison... or something like that. I didn't write the language, but I'm pretty certain `foo.bar` makes better bytecode than `{'bar': 123}['bar']`, since the former is a far, far more frequent case.
Mike DeSimone
Mike, intuitively, I believe you, too, but the timing results I just produced in a comparison showed otherwise. http://bitbucket.org/gotgenes/interesting-python-timings/src/ The summary is, for Python 2.6.4, class attribute access < dictionary lookup < `__slots__` access < instance attribute access, in terms of time taken. Dictionary lookups are about 10% quicker than `__slots__`.
gotgenes
A: 

Without looking deeply into your code, try adding a little speed with itertools.

Add these at the module level:

import itertools as it, operator as op
GET_WEIGHT= op.attrgetter('weight')

Change:

neighbors_in_selected_nodes = (neighbor for neighbor in
        node_neighbors if neighbor in selected_nodes)

into:

neighbors_in_selected_nodes = it.ifilter(selected_node.__contains__, node_neighbors)

and:

neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
        neighbor in neighbors_in_selected_nodes)

into:

neighbor_z_scores = (
    it.imap(
        GET_WEIGHT,
        it.imap(
            interaction_graph.node.__getitem__,
            neighbors_in_selected_nodes)
    )
)

Do these help?

ΤΖΩΤΖΙΟΥ