views:

246

answers:

1

Hi guys,

I have HDF5 files with multiple groups, where each group contains a data set with >= 25 million rows. At each time step of simulation, each agent outputs the other agents he/she sensed at that time step. There are ~2000 agents in the scenario and thousands of time steps; the O(n^2) nature of the output explains the huge number of rows.

What I'm interested in calculating is the number of unique sightings by category. For instance, agents belong to a side, red, blue, or green. I want to make a two-dimensional table where row i, column j is the number of agents in category j that were sensed by at least one agent in category i. (I'm using the Sides in this code example, but we could classify the agents in other ways as well, such as by the weapon they have, or the sensors they carry.)

Here's a sample output table; note that the simulation does not output blue/blue sensations because it takes a ton of room and we aren't interested in them. Same for green green)

      blue     green      red
blue  0      492       186
green 1075    0     186
red   451    498      26

The columns are

  1. tick - time step
  2. sensingAgentId - id of agent doing sensing
  3. sensedAgentId - id of agent being sensed
  4. detRange - range in meters between two agents
  5. senseType - an enumerated type for what type of sensing was done

Here's the code I am currently using to accomplish this:

def createHeatmap():
  h5file = openFile("someFile.h5")
  run0 = h5file.root.run0.detections

  # A dictionary of dictionaries, {'blue': {'blue':0, 'red':0, ...}
  classHeat = emptyDict(sides)

  # Interested in Per Category Unique Detections
  seenClass = {}

  # Initially each side has seen no one    
  for theSide in sides:
    seenClass[theSide] = []

  # In-kernel search filtering out many rows in file; in this instance 25,789,825 rows
  # are filtered to 4,409,176  
  classifications = run0.where('senseType == 3')

  # Iterate and filter 
  for row in classifications:
    sensedId = row['sensedAgentId']
    # side is a function that returns the string representation of the side of agent
    # with that id.
    sensedSide = side(sensedId)
    sensingSide = side(row['sensingAgentId'])

    # The side has already seen this agent before; ignore it
    if sensedId in seenClass[sensingSide]:
      continue
    else:
      classHeat[sensingSide][sensedSide] += 1
      seenClass[sensingSide].append(sensedId)


  return classHeat

Note: I have a Java background, so I apologize if this is not Pythonic. Please point this out and suggest ways to improve this code, I would love to become more proficient with Python.

Now, this is very slow: it takes approximately 50 seconds to do this iteration and membership checking, and this is with the most restrictive set of membership criteria (other detection types have many more rows to iterate over).

My question is, is it possible to move the work out of python and into the in-kernel search query? If so, how? Is there some glaringly obvious speedup I am missing? I need to be able to run this function for each run in a set of runs (~30), and for multiple set of criteria (~5), so it would be great if this could be sped up.

Final note: I tried using psyco but that barely made a difference.

+1  A: 

If you have N=~2k agents, I suggest putting all sightings into a numpy array of size NxN. This easily fits in memory (around 16 meg for integers). Just store a 1 wherever a sighting occurred.

Assume that you have an array sightings. The first coordinate is Sensing, the second is Sensed. Assume you also have 1-d index arrays listing which agents are on which side. You can get the number of sightings of side B by side A this way:

sideAseesB = sightings[sideAindices, sideBindices]
sideAseesBcount = numpy.logical_or.reduce(sideAseesB, axis=0).sum()

It's possible you'd need to use sightings.take(sideAindices, axis=0).take(sideBindices, axis=1) in the first step, but I doubt it.

thouis
http://pastebin.com/f5490dce0 - what am I doing wrong?
I82Much
I was wrong, you do need the sightings.take(...).take(...) bit. (I updated the pastebin)
thouis