views:

106

answers:

4

Hello All,

I recently came across Pytables and find it to be very cool. It is clear that they are superior to a csv format for very large data sets. I am running some simulations using python. The output is not so large, say 200 columns and 2000 rows.

If someone has experience with both, can you suggest which format would be more convenient in the long run for such data sets that are not very large. Pytables has data manipulation capabilities and browsing of the data with Vitables, but the browser does not have as much functionality as, say Excel, which can be used for CSV. Similarly, do you find one better than the other for importing and exporting data, if working mainly in python? Is one more convenient in terms of file organization? Any comments on issues such as these would be helpful.

Thanks.

A: 

These are not "exclusive" choices.

You need both.

CSV is just a data exchange format. If you use pytables, you still need to import and export in CSV format.

S.Lott
Can you please elaborate? I don't need to create CSV files to use pytables. Thanks!
Curious2learn
You need to create CSV to exchange data with applications that only accept CSV. Spreadsheets, for example.
S.Lott
+2  A: 

As far as importing/exporting goes, PyTables uses a standardized file format called HDF5. Many scientific software packages (like MATLAB) have built-in support for HDF5, and the C API isn't terrible. So any data you need to export from or import to one of these languages can simply be kept in HDF5 files.

PyTables does add some attributes of its own, but these shouldn't hurt you. Of course, if you store Python objects in the file, you won't be able to read them elsewhere.

The one nice thing about CSV files is that they're human readable. However, if you need to store anything other than simple numbers in them and communicate with others, you'll have issues. I receive CSV files from people in other organizations, and I've noticed that humans aren't good at making sure things like string quoting are done correctly. It's good that Python's CSV parser is as flexible as it is. One other issue is that floating point numbers can't be stored exactly in text using decimal format. It's usually good enough, though.

kwatford
Thanks for the feedback! Would you say that with ViTables, even PyTables become human readable.
Curious2learn
+2  A: 

Have you considered Numpy arrays?

PyTables are wonderful when your data is too large to fit in memory, but a 200x2000 matrix of 8 byte floats only requires about 3MB of memory. So I think PyTables may be overkill.

You can save numpy arrays to files using np.savetxt or np.savez (for compression), and can read them from files with np.loadtxt or np.load.

If you have many such arrays to store on disk, then I'd suggest using a database instead of numpy .npz files. By the way, to store a 200x2000 matrix in a database, you only need 4 table columns: id, row, col, value:

import sqlite3
import numpy as np

db = sqlite3.connect(':memory:')
cursor = db.cursor()
cursor.execute('''CREATE TABLE foo
             (id INTEGER PRIMARY KEY AUTOINCREMENT,
              row INTEGER, col INTEGER, value FLOAT)''')
ROWS=4
COLUMNS=6
matrix = np.random.random((ROWS,COLUMNS))
print(matrix)
# [[ 0.87050721  0.22395398  0.19473001  0.14597821  0.02363803  0.20299432]
#  [ 0.11744885  0.61332597  0.19860043  0.91995295  0.84857095  0.53863863]
#  [ 0.80123759  0.52689885  0.05861043  0.71784406  0.20222138  0.63094807]
#  [ 0.01309897  0.45391578  0.04950273  0.93040381  0.41150517  0.66263562]]

# Store matrix in table foo
cursor.executemany('INSERT INTO foo(row, col, value) VALUES (?,?,?) ',
              ((r,c,value) for r,row in enumerate(matrix) 
                               for c,value in enumerate(row)))

# Retrieve matrix from table foo
cursor.execute('SELECT value FROM foo ORDER BY row,col')
data=zip(*cursor.fetchall())[0]
matrix2 = np.fromiter(data,dtype=np.float).reshape((ROWS,COLUMNS))
print(matrix2)
# [[ 0.87050721  0.22395398  0.19473001  0.14597821  0.02363803  0.20299432]
#  [ 0.11744885  0.61332597  0.19860043  0.91995295  0.84857095  0.53863863]
#  [ 0.80123759  0.52689885  0.05861043  0.71784406  0.20222138  0.63094807]
#  [ 0.01309897  0.45391578  0.04950273  0.93040381  0.41150517  0.66263562]]

If you have many such 200x2000 matrices, you just need one more table column to specify which matrix.

unutbu
This sounds interesting. I don't know much about databases, but will look into this and post back. What is not clear to me from your example is that how are the coordinates of each value in the 2000 rows x 200 columns matrix are being assigned to the database table. I will try to figure that out.
Curious2learn
A: 

i think its very hard to comapre pytables and csv.. pyTable is a datastructure ehile CSV is an exchange format for data.

mossplix