I have written a program in Python which spends a large amount of time looking up attributes of objects and values from dictionary keys. I would like to know if there's any way I can optimize these lookup times, potentially with a C extension, to reduce the time of execution, or if I need to simply re-implement the program in a compiled language.
The program implements some algorithms using a graph. It runs prohibitively slowly on our data sets, so I profiled the code with cProfile
using a reduced data set that could actually complete. The vast majority of the time is being burned in one function, and specifically in two statements, generator expressions, within the function:
The generator expression at line 202 is
neighbors_in_selected_nodes = (neighbor for neighbor in
node_neighbors if neighbor in selected_nodes)
and the generator expression at line 204 is
neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
neighbor in neighbors_in_selected_nodes)
The source code for this function of context provided below.
selected_nodes
is a set
of nodes in the interaction_graph
, which is a NetworkX Graph
instance. node_neighbors
is an iterator from Graph.neighbors_iter()
.
Graph
itself uses dictionaries for storing nodes and edges. Its Graph.node
attribute is a dictionary which stores nodes and their attributes (e.g., 'weight'
) in dictionaries belonging to each node.
Each of these lookups should be amortized constant time (i.e., O(1)), however, I am still paying a large penalty for the lookups. Is there some way which I can speed up these lookups (e.g., by writing parts of this as a C extension), or do I need to move the program to a compiled language?
Below is the full source code for the function that provides the context; the vast majority of execution time is spent within this function.
def calculate_node_z_prime(
node,
interaction_graph,
selected_nodes
):
"""Calculates a z'-score for a given node.
The z'-score is based on the z-scores (weights) of the neighbors of
the given node, and proportional to the z-score (weight) of the
given node. Specifically, we find the maximum z-score of all
neighbors of the given node that are also members of the given set
of selected nodes, multiply this z-score by the z-score of the given
node, and return this value as the z'-score for the given node.
If the given node has no neighbors in the interaction graph, the
z'-score is defined as zero.
Returns the z'-score as zero or a positive floating point value.
:Parameters:
- `node`: the node for which to compute the z-prime score
- `interaction_graph`: graph containing the gene-gene or gene
product-gene product interactions
- `selected_nodes`: a `set` of nodes fitting some criterion of
interest (e.g., annotated with a term of interest)
"""
node_neighbors = interaction_graph.neighbors_iter(node)
neighbors_in_selected_nodes = (neighbor for neighbor in
node_neighbors if neighbor in selected_nodes)
neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
neighbor in neighbors_in_selected_nodes)
try:
max_z_score = max(neighbor_z_scores)
# max() throws a ValueError if its argument has no elements; in this
# case, we need to set the max_z_score to zero
except ValueError, e:
# Check to make certain max() raised this error
if 'max()' in e.args[0]:
max_z_score = 0
else:
raise e
z_prime = interaction_graph.node[node]['weight'] * max_z_score
return z_prime
Here are the top couple of calls according to cProfiler, sorted by time.
ncalls tottime percall cumtime percall filename:lineno(function)
156067701 352.313 0.000 642.072 0.000 bpln_contextual.py:204(<genexpr>)
156067701 289.759 0.000 289.759 0.000 bpln_contextual.py:202(<genexpr>)
13963893 174.047 0.000 816.119 0.000 {max}
13963885 69.804 0.000 936.754 0.000 bpln_contextual.py:171(calculate_node_z_prime)
7116883 61.982 0.000 61.982 0.000 {method 'update' of 'set' objects}