Hi guys,
I have HDF5 files with multiple groups, where each group contains a data set with >= 25 million rows. At each time step of simulation, each agent outputs the other agents he/she sensed at that time step. There are ~2000 agents in the scenario and thousands of time steps; the O(n^2) nature of the output explains the huge number of rows.
What I'm interested in calculating is the number of unique sightings by category. For instance, agents belong to a side, red, blue, or green. I want to make a two-dimensional table where row i, column j is the number of agents in category j
that were sensed by at least one agent in category i. (I'm using the Sides in this code example, but we could classify the agents in other ways as well, such as by the weapon they have, or the sensors they carry.)
Here's a sample output table; note that the simulation does not output blue/blue sensations because it takes a ton of room and we aren't interested in them. Same for green green)
blue green red
blue 0 492 186
green 1075 0 186
red 451 498 26
The columns are
- tick - time step
- sensingAgentId - id of agent doing sensing
- sensedAgentId - id of agent being sensed
- detRange - range in meters between two agents
- senseType - an enumerated type for what type of sensing was done
Here's the code I am currently using to accomplish this:
def createHeatmap():
h5file = openFile("someFile.h5")
run0 = h5file.root.run0.detections
# A dictionary of dictionaries, {'blue': {'blue':0, 'red':0, ...}
classHeat = emptyDict(sides)
# Interested in Per Category Unique Detections
seenClass = {}
# Initially each side has seen no one
for theSide in sides:
seenClass[theSide] = []
# In-kernel search filtering out many rows in file; in this instance 25,789,825 rows
# are filtered to 4,409,176
classifications = run0.where('senseType == 3')
# Iterate and filter
for row in classifications:
sensedId = row['sensedAgentId']
# side is a function that returns the string representation of the side of agent
# with that id.
sensedSide = side(sensedId)
sensingSide = side(row['sensingAgentId'])
# The side has already seen this agent before; ignore it
if sensedId in seenClass[sensingSide]:
continue
else:
classHeat[sensingSide][sensedSide] += 1
seenClass[sensingSide].append(sensedId)
return classHeat
Note: I have a Java background, so I apologize if this is not Pythonic. Please point this out and suggest ways to improve this code, I would love to become more proficient with Python.
Now, this is very slow: it takes approximately 50 seconds to do this iteration and membership checking, and this is with the most restrictive set of membership criteria (other detection types have many more rows to iterate over).
My question is, is it possible to move the work out of python and into the in-kernel search query? If so, how? Is there some glaringly obvious speedup I am missing? I need to be able to run this function for each run in a set of runs (~30), and for multiple set of criteria (~5), so it would be great if this could be sped up.
Final note: I tried using psyco but that barely made a difference.