ansaurus

Question

Python: Data Structure for Maintaing Tabular Data in Memory?

Answer 1

+2 A:

I personally would use the list of row lists. Because the data for each row is always in the same order, you can easily sort by any of the columns by simply accessing that element in each of the lists. You can also easily count based on a particular column in each list, and make searches as well. It's basically as close as it gets to a 2-d array.

Really the only disadvantage here is that you have to know in what order the data is in, and if you change that ordering, you'll have to change your search/sorting routines to match.

Another thing you can do is have a list of dictionaries.

rows = []
rows.append({"ID":"1", "name":"Cat", "year":"1998", "priority":"1"})

This would avoid needing to know the order of the parameters, so you can look through each "year" field in the list.

AlbertoPL 2009-06-24 12:58:05

+1: Lists of dictionaries work REALLY well, and are compatible with reading and writing JSON or CSV files.

S.Lott 2009-06-24 14:01:32

Answer 2

+8 A:

Having a "table" in memory that needs lookups, sorting, and arbitrary aggregation really does call out for SQL. You said you tried SQLite, but did you realize that SQLite can use an in-memory-only database?

connection = sqlite3.connect(':memory:')

Then you can create/drop/query/update tables in memory with all the functionality of SQLite and no files left over when you're done. And as of Python 2.5, sqlite3 is in the standard library, so it's not really "overkill" IMO.

Here is a sample of how one might create and populate the database:

import csv
import sqlite3

db = sqlite3.connect(':memory:')

def init_db(cur):
    cur.execute('''CREATE TABLE foo (
        Row INTEGER,
        Name TEXT,
        Year INTEGER,
        Priority INTEGER)''')

def populate_db(cur, csv_fp):
    rdr = csv.reader(csv_fp)
    cur.executemany('''
        INSERT INTO foo (Row, Name, Year, Priority)
        VALUES (?,?,?,?)''', rdr)

cur = db.cursor()
init_db(cur)
populate_db(cur, open('my_csv_input_file.csv'))
db.commit()

If you'd really prefer not to use SQL, you should probably use a list of dictionaries:

lod = [ ] # "list of dicts"

def populate_lod(lod, csv_fp):
    rdr = csv.DictReader(csv_fp, ['Row', 'Name', 'Year', 'Priority'])
    lod.extend(rdr)

def query_lod(lod, filter=None, sort_keys=None):
    if filter is not None:
        lod = (r for r in lod if filter(r))
    if sort_keys is not None:
        lod = sorted(lod, key=lambda r:[r[k] for k in sort_keys])
    else:
        lod = list(lod)
    return lod

def lookup_lod(lod, **kw):
    for row in lod:
        for k,v in kw.iteritems():
            if row[k] != str(v): break
        else:
            return row
    return None

Testing then yields:

>>> lod = []
>>> populate_lod(lod, csv_fp)
>>> 
>>> pprint(lookup_lod(lod, Row=1))
{'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'}
>>> pprint(lookup_lod(lod, Name='Aardvark'))
{'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'}
>>> pprint(query_lod(lod, sort_keys=('Priority', 'Year')))
[{'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'},
 {'Name': 'Dog', 'Priority': '1', 'Row': '3', 'Year': '1999'},
 {'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'},
 {'Name': 'Wallaby', 'Priority': '1', 'Row': '5', 'Year': '2000'},
 {'Name': 'Fish', 'Priority': '2', 'Row': '2', 'Year': '1998'},
 {'Name': 'Zebra', 'Priority': '3', 'Row': '6', 'Year': '2001'}]
>>> pprint(query_lod(lod, sort_keys=('Year', 'Priority')))
[{'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'},
 {'Name': 'Fish', 'Priority': '2', 'Row': '2', 'Year': '1998'},
 {'Name': 'Dog', 'Priority': '1', 'Row': '3', 'Year': '1999'},
 {'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'},
 {'Name': 'Wallaby', 'Priority': '1', 'Row': '5', 'Year': '2000'},
 {'Name': 'Zebra', 'Priority': '3', 'Row': '6', 'Year': '2001'}]
>>> print len(query_lod(lod, lambda r:1997 <= int(r['Year']) <= 2002))
6
>>> print len(query_lod(lod, lambda r:int(r['Year'])==1998 and int(r['Priority']) > 2))
0

Personally I like the SQLite version better since it preserves your types better (without extra conversion code in Python) and easily grows to accommodate future requirements. But then again, I'm quite comfortable with SQL, so YMMV.

Rick Copeland 2009-06-24 13:00:00

What's the recommended approach for populating the database in this example?

Roee Adler 2009-06-24 13:13:30

I have edited the answer to include sample code to populate the database from a CSV file.

Rick Copeland 2009-06-24 13:28:12

-1: overkill, a list of dictionaries is probably more useful for this application.

S.Lott 2009-06-24 14:02:13

Lists of dictionaries make the queries, sorting, and aggregation a lot more verbose, even with list comprehensions and sort(..., key=) stuff. So I'd say sqlite in-memory is just about a perfect match, here. To each his own...

Rick Copeland 2009-06-24 14:27:13

To my mind, "verbose" Python is slightly better than SQL.

S.Lott 2009-06-24 14:34:02

I've added the list of dictionaries option to the answer, as well.

Rick Copeland 2009-06-24 14:57:59

Wow, what an investment. Chapeaux!

Roee Adler 2009-06-24 15:17:16

Answer 3

+2 A:

First, given that you have a complex data retrieval scenario, are you sure even SQLite is overkill?

You'll end up having an ad hoc, informally-specified, bug-ridden, slow implementation of half of SQLite, paraphrasing Greenspun's Tenth Rule.

That said, you are very right in saying that choosing a single data structure will impact one or more of searching, sorting or counting, so if performance is paramount and your data is constant, you could consider having more than one structure for different purposes.

Above all, measure what operations will be more common and decide which structure will end up costing less.

Vinko Vrsalovic 2009-06-24 13:00:06

Answer 4

+1 A:

Have a Table class whose rows is a list of dict or better row objects

In table do not directly add rows but have a method which update few lookup maps e.g. for name if you are not adding rows in order or id are not consecutive you can have idMap too e.g.

class Table(object):
    def __init__(self):
        self.rows =  []# list of row objects, we assume if order of id
        self.nameMap = {} # for faster direct lookup for row by name

    def addRow(self, row):
        self.rows.append(row)
        self.nameMap[row['name']] = row

    def getRow(self, name):
        return self.nameMap[name]


table = Table()
table.addRow({'ID':1,'name':'a'})

Anurag Uniyal 2009-06-24 13:12:43

ansaurus

tags:

views:

answers:

Python: Data Structure for Maintaing Tabular Data in Memory?

related questions