views:

146

answers:

1

I am new to PyTables, and am looking at using it to process data generated from an agent-based modeling simulation and stored in HDF5. I'm working with a 39 MB test file, and am experiencing some strangeness. Here's the layout of the table:

    /example/agt_coords (Table(2000000,)) ''
  description := {
  "agent": Int32Col(shape=(), dflt=0, pos=0),
  "x": Float64Col(shape=(), dflt=0.0, pos=1),
  "y": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (20000,)

Here's how I'm accessing it in Python:

from tables import *
>>> h5file = openFile("alternate_hose_test.h5", "a")

h5file.root.example.agt_coords
/example/agt_coords (Table(2000000,)) ''
  description := {
  "agent": Int32Col(shape=(), dflt=0, pos=0),
  "x": Float64Col(shape=(), dflt=0.0, pos=1),
  "y": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (20000,)
>>> coords = h5file.root.example.agt_coords

Now here's where things get weird.

[x for x in coords[1:100] if x['agent'] == 1]
[(1, 25.0, 78.0), (1, 25.0, 78.0)]
>>> [x for x in coords if x['agent'] == 1]
[(1000000, 25.0, 78.0), (1000000, 25.0, 78.0)]
>>> [x for x in coords.iterrows() if x['agent'] == 1]
[(1000000, 25.0, 78.0), (1000000, 25.0, 78.0)]
>>> [x['agent'] for x in coords[1:100] if x['agent'] == 1]
[1, 1]
>>> [x['agent'] for x in coords if x['agent'] == 1]
[1, 1]

I don't understand why the values are screwed up when I iterate over the whole table, but not when I take a small subset of the whole set of rows. I'm sure this is an error in how I'm using the library, so any help in this matter would be extremely appreciated.

+4  A: 

This is a very common point of confusion when iterating over Table object,

When you iterate over a Table the type of item you get is not the data at the item, but an accessor to the table at the current row. So with

[x for x in coords if x['agent'] == 1]

you create a list of row accessors that all point to the "current" row of the table, the last row. But when you do

[x["agent"] for x in coords if x['agent'] == 1]

you use the accessor as you build the list.

The solution to get all the data you need as you build the list, by using the accessor on each iteration. There are two options

[x[:] for x in coords if x['agent'] == 1]

or

[x.fetch_all_fields() for x in coords if x['agent'] == 1]

The former builds a list of tuples. The latter returns a NumPy void object. IIRC, the second is faster, but the former might make more sense for you purposes.

Here's a good explanation from the PyTables developer. In future releases, printing a row accessor object may not simply show the data, but state that it's a row accessor object.

AFoglia
Thank you for the very detailed explanation of the unexpected behavior. I really wish that the tutorial (e.g. http://www.pytables.org/moin/HintsForSQLUsers , http://www.pytables.org/docs/manual/ch03.html (see 3.1.6)made note of what you explained
I82Much