ansaurus

Question

Pytables vs. CSV for files that are not very large

Answer 1

A:

These are not "exclusive" choices.

You need both.

CSV is just a data exchange format. If you use pytables, you still need to import and export in CSV format.

S.Lott 2010-10-26 11:07:43

Can you please elaborate? I don't need to create CSV files to use pytables. Thanks!

Curious2learn 2010-10-26 12:11:24

You need to create CSV to exchange data with applications that only accept CSV. Spreadsheets, for example.

S.Lott 2010-10-26 13:42:35

Answer 2

+2 A:

As far as importing/exporting goes, PyTables uses a standardized file format called HDF5. Many scientific software packages (like MATLAB) have built-in support for HDF5, and the C API isn't terrible. So any data you need to export from or import to one of these languages can simply be kept in HDF5 files.

PyTables does add some attributes of its own, but these shouldn't hurt you. Of course, if you store Python objects in the file, you won't be able to read them elsewhere.

The one nice thing about CSV files is that they're human readable. However, if you need to store anything other than simple numbers in them and communicate with others, you'll have issues. I receive CSV files from people in other organizations, and I've noticed that humans aren't good at making sure things like string quoting are done correctly. It's good that Python's CSV parser is as flexible as it is. One other issue is that floating point numbers can't be stored exactly in text using decimal format. It's usually good enough, though.

kwatford 2010-10-26 11:19:23

Thanks for the feedback! Would you say that with ViTables, even PyTables become human readable.

Curious2learn 2010-10-26 14:49:04

Answer 3

+2 A:

Have you considered Numpy arrays?

PyTables are wonderful when your data is too large to fit in memory, but a 200x2000 matrix of 8 byte floats only requires about 3MB of memory. So I think PyTables may be overkill.

You can save numpy arrays to files using np.savetxt or np.savez (for compression), and can read them from files with np.loadtxt or np.load.

If you have many such arrays to store on disk, then I'd suggest using a database instead of numpy .npz files. By the way, to store a 200x2000 matrix in a database, you only need 4 table columns: id, row, col, value:

import sqlite3
import numpy as np

db = sqlite3.connect(':memory:')
cursor = db.cursor()
cursor.execute('''CREATE TABLE foo
             (id INTEGER PRIMARY KEY AUTOINCREMENT,
              row INTEGER, col INTEGER, value FLOAT)''')
ROWS=4
COLUMNS=6
matrix = np.random.random((ROWS,COLUMNS))
print(matrix)
# [[ 0.87050721  0.22395398  0.19473001  0.14597821  0.02363803  0.20299432]
#  [ 0.11744885  0.61332597  0.19860043  0.91995295  0.84857095  0.53863863]
#  [ 0.80123759  0.52689885  0.05861043  0.71784406  0.20222138  0.63094807]
#  [ 0.01309897  0.45391578  0.04950273  0.93040381  0.41150517  0.66263562]]

# Store matrix in table foo
cursor.executemany('INSERT INTO foo(row, col, value) VALUES (?,?,?) ',
              ((r,c,value) for r,row in enumerate(matrix) 
                               for c,value in enumerate(row)))

# Retrieve matrix from table foo
cursor.execute('SELECT value FROM foo ORDER BY row,col')
data=zip(*cursor.fetchall())[0]
matrix2 = np.fromiter(data,dtype=np.float).reshape((ROWS,COLUMNS))
print(matrix2)
# [[ 0.87050721  0.22395398  0.19473001  0.14597821  0.02363803  0.20299432]
#  [ 0.11744885  0.61332597  0.19860043  0.91995295  0.84857095  0.53863863]
#  [ 0.80123759  0.52689885  0.05861043  0.71784406  0.20222138  0.63094807]
#  [ 0.01309897  0.45391578  0.04950273  0.93040381  0.41150517  0.66263562]]

If you have many such 200x2000 matrices, you just need one more table column to specify which matrix.

unutbu 2010-10-26 13:05:30

This sounds interesting. I don't know much about databases, but will look into this and post back. What is not clear to me from your example is that how are the coordinates of each value in the 2000 rows x 200 columns matrix are being assigned to the database table. I will try to figure that out.

Curious2learn 2010-10-26 14:49:57

Answer 4

A:

i think its very hard to comapre pytables and csv.. pyTable is a datastructure ehile CSV is an exchange format for data.

mossplix 2010-10-26 13:15:50

ansaurus

tags:

views:

answers:

Pytables vs. CSV for files that are not very large

related questions