views:

3580

answers:

12

I have a data structure which essentially amounts to a nested dictionary. Let's say it looks like this:

{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

Now, maintaining and creating this is pretty painful; every time I have a new state/county/profession I have to create the lower layer dictionaries via obnoxious try/catch blocks. Moreover, I have to create annoying nested iterators if I want to go over all the values.

I could also use tuples as keys, like such:

{('new jersey', 'mercer county', 'plumbers'): 3,
 ('new jersey', 'mercer county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'salesmen'): 62,
 ('new york', 'queens county', 'plumbers'): 9,
 ('new york', 'queens county', 'salesmen'): 36}

This makes iterating over the values very simple and natural, but it is more syntactically painful to do things like aggregations and looking at subsets of the dictionary (e.g. if I just want to go state-by-state).

Basically, sometimes I want to think of a nested dictionary as a flat dictionary, and sometimes I want to think of it indeed as a complex hierarchy. I could wrap this all in a class, but it seems like someone might have done this already. Alternatively, it seems like there might be some really elegant syntactical constructions to do this.

How could I do this better?

Addendum: I'm aware of setdefault() but it doesn't really make for clean syntax. Also, each sub-dictionary you create still needs to have setdefault() manually set.

+4  A: 

I find 'setdefault' quite useful - it checks if a key is present and adds it if not:

d = {}
d.setdefault('new jersey', {}).setdefault('mercer county', {})['plumbers'] = 3

setdefault always returns the relevant key, so you are actually updating the values of 'd' in place.

When it comes to iterating, I'm sure you could write a generator easily enough if one doesn't already exist in Python:

def iterateStates(d):
    # Let's count up the total number of "plumbers" / "dentists" / etc.
    # across all counties and states
    job_totals = {}

    # I guess this is the annoying nested stuff you were talking about?
    for (state, counties) in d.iteritems():
        for (county, jobs) in counties.iteritems():
            for (job, num) in jobs.iteritems():
                # If job isn't already in job_totals, default it to zero
                job_totals[job] = job_totals.get(job, 0) + num

    # Now return an iterator of (job, number) tuples
    return job_totals.iteritems()

# Display all jobs
for (job, num) in iterateStates(d):
    print "There are %d %s in total" % (job, num)
andygeers
+2  A: 

As for "obnoxious try/catch blocks":

d = {}
d.setdefault('key',{}).setdefault('inner key',{})['inner inner key'] = 'value'
print d

yields

{'key': {'inner key': {'inner inner key': 'value'}}}

You can use this to convert from your flat dictionary format to structured format:

fd = {('new jersey', 'mercer county', 'plumbers'): 3,
 ('new jersey', 'mercer county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'salesmen'): 62,
 ('new york', 'queens county', 'plumbers'): 9,
 ('new york', 'queens county', 'salesmen'): 36}

for (k1,k2,k3), v in fd.iteritems():
    d.setdefault(k1, {}).setdefault(k2, {})[k3] = v
vartec
+1  A: 

I like the idea of wrapping this in a class and implementing __getitem__ and __setitem__ such that they implemented a simple query language:

>>> d['new jersey/mercer county/plumbers'] = 3
>>> d['new jersey/mercer county/programmers'] = 81
>>> d['new jersey/mercer county/programmers']
81
>>> d['new jersey/mercer country']
<view which implicitly adds 'new jersey/mercer county' to queries/mutations>

If you wanted to get fancy you could also implement something like:

>>> d['*/*/programmers']
<view which would contain 'programmers' entries>

... but mostly I think such a thing would be really fun to implement :D

Aaron Maenpaa
I think this is a bad idea -- you can never predict the syntax of keys. You would still override __getitem__ and __setitem__ but have them take tuples.
YGA
@YGA You're probably right, but it's fun to think about implementing mini languages like this.
Aaron Maenpaa
+6  A: 

Since you have a star-schema design, you might want to structure it more like a relational table and less like a dictionary.

import collections

class Jobs( object ):
    def __init__( self, state, county, title, count ):
        self.state= state
        self.count= county
        self.title= title
        self.count= count

facts = [
    Jobs( 'new jersey', 'mercer county', 'plumbers', 3 ),
    ...

def groupBy( facts, name ):
    total= collections.defaultdict( int )
    for f in facts:
        key= getattr( f, name )
        total[key] += f.count

That kind of thing can go a long way to creating a data warehouse-like design without the SQL overheads.

S.Lott
+1  A: 
class JobDb(object):
    def __init__(self):
        self.data = []
        self.all = set()
        self.free = []
        self.index1 = {}
        self.index2 = {}
        self.index3 = {}

    def _indices(self,(key1,key2,key3)):
        indices = self.all.copy()
        wild = False
        for index,key in ((self.index1,key1),(self.index2,key2),
                                             (self.index3,key3)):
            if key is not None:
                indices &= index.setdefault(key,set())
            else:
                wild = True
        return indices, wild

    def __getitem__(self,key):
        indices, wild = self._indices(key)
        if wild:
            return dict(self.data[i] for i in indices)
        else:
            values = [self.data[i][-1] for i in indices]
            if values:
                return values[0]

    def __setitem__(self,key,value):
        indices, wild = self._indices(key)
        if indices:
            for i in indices:
                self.data[i] = key,value
        elif wild:
            raise KeyError(k)
        else:
            if self.free:
                index = self.free.pop(0)
                self.data[index] = key,value
            else:
                index = len(self.data)
                self.data.append((key,value))
                self.all.add(index)
            self.index1.setdefault(key[0],set()).add(index)
            self.index2.setdefault(key[1],set()).add(index)
            self.index3.setdefault(key[2],set()).add(index)

    def __delitem__(self,key):
        indices,wild = self._indices(key)
        if not indices:
            raise KeyError
        self.index1[key[0]] -= indices
        self.index2[key[1]] -= indices
        self.index3[key[2]] -= indices
        self.all -= indices
        for i in indices:
            self.data[i] = None
        self.free.extend(indices)

    def __len__(self):
        return len(self.all)

    def __iter__(self):
        for key,value in self.data:
            yield key

Example:

>>> db = JobDb()
>>> db['new jersey', 'mercer county', 'plumbers'] = 3
>>> db['new jersey', 'mercer county', 'programmers'] = 81
>>> db['new jersey', 'middlesex county', 'programmers'] = 81
>>> db['new jersey', 'middlesex county', 'salesmen'] = 62
>>> db['new york', 'queens county', 'plumbers'] = 9
>>> db['new york', 'queens county', 'salesmen'] = 36

>>> db['new york', None, None]
{('new york', 'queens county', 'plumbers'): 9,
 ('new york', 'queens county', 'salesmen'): 36}

>>> db[None, None, 'plumbers']
{('new jersey', 'mercer county', 'plumbers'): 3,
 ('new york', 'queens county', 'plumbers'): 9}

>>> db['new jersey', 'mercer county', None]
{('new jersey', 'mercer county', 'plumbers'): 3,
 ('new jersey', 'mercer county', 'programmers'): 81}

>>> db['new jersey', 'middlesex county', 'programmers']
81

>>>

Edit: Now returning dictionaries when querying with wild cards (None), and single values otherwise.

MizardX
Why return lists? Seems it should either return a dictionary (so you know what each number represents) or a sum (since that's all you can really do with the list).
Ben Blank
+4  A: 

If the number of nesting levels is small, I use collections.defaultdict for this:

from collections import defaultdict

def nested_dict_factory(): 
  return defaultdict(int)
def nested_dict_factory2(): 
  return defaultdict(nested_dict_factory)
db = defaultdict(nested_dict_factory2)

db['new jersey']['mercer county']['plumbers'] = 3
db['new jersey']['mercer county']['programmers'] = 81

Using defaultdict like this avoids a lot of messy setdefault(), get(), etc.

+1: defaultdict is one of my all-time favourite additions to python. No more .setdefault()!
John Fouhy
+1  A: 

For easy iterating over your nested dictionary, why not just write a simple generator?

def each_job(my_dict):
    for state, a in my_dict.items():
        for county, b in a.items():
            for job, value in b.items():
                yield {
                    'state'  : state,
                    'county' : county,
                    'job'    : job,
                    'value'  : value
                }

So then, if you have your compilicated nested dictionary, iterating over it becomes simple:

for r in each_job(my_dict):
    print "There are %d %s in %s, %s" % (r['value'], r['job'], r['county'], r['state'])

Obviously your generator can yield whatever format of data is useful to you.

Why are you using try catch blocks to read the tree? It's easy enough (and probably safer) to query whether a key exists in a dict before trying to retrieve it. A function using guard clauses might look like this:

if not my_dict.has_key('new jersey'):
    return False

nj_dict = my_dict['new jersey']
...

Or, a perhaps somewhat verbose method, is to use the get method:

value = my_dict.get('new jersey', {}).get('middlesex county', {}).get('salesmen', 0)

But for a somewhat more succinct way, you might want to look at using a collections.defaultdict, which is part of the standard library since python 2.5.

import collections

def state_struct(): return collections.defaultdict(county_struct)
def county_struct(): return collections.defaultdict(job_struct)
def job_struct(): return 0

my_dict = collections.defaultdict(state_struct)

print my_dict['new jersey']['middlesex county']['salesmen']

I'm making assumptions about the meaning of your data structure here, but it should be easy to adjust for what you actually want to do.

SpoonMeiser
+6  A: 

I would create a YAML file and read it in using PyYaml.

Step 1: Cut a hole in a box. Err--wrong set of instructions.

Step 2: Create a YAML file, "employment.yml":

new jersey:
  mercer county:
    pumbers: 3
    programmers: 81
  middlesex county:
    salesmen: 62
    programmers: 81
new york:
  queens county:
    plumbers: 9
    salesmen: 36

Step 3: Read it in Python

import yaml
file_handle = open("employment.yml")
my_shnazzy_dictionary = yaml.safe_load(file_handle)
file_handle.close()

and now my_shnazzy_dictionary has all your values. If you needed to do this on the fly, create a temporary YAML file & read it back in.

EDIT: As suggested by Moser in the comments, you dont even need to create tempfile's. Just create the YAML as a string & feed that into yaml.safe_load(...)

Pete
YAML is my definitely my choice for inputting lots of deeply nested data (and configuration files, databaes mockups, etc...). If the OP doesn't want extra files lying around, just use a regular Python string in some file and parse that with YAML.
Mosor
Good point on creating YAML strings: This would be a much cleaner approach than using the "tempfile" module repeatedly.
Pete
+1 for amusement value
YGA
+1  A: 

Unless your dataset is going to stay pretty small, you might want to consider using a relational database. It will do exactly what you want: make it easy to add counts, selecting subsets of counts, and even aggregate counts by state, county, occupation, or any combination of these.

allyourcode
+1  A: 

collections.defaultdict can be subclassed to make a nested dict. Then add any useful iteration methods to that class.

>>> from collections import defaultdict
>>> class nesteddict(defaultdict):
    def __init__(self):
        defaultdict.__init__(self, nesteddict)
    def walk(self):
        for key, value in self.iteritems():
            if isinstance(value, nesteddict):
                for tup in value.walk():
                    yield (key,) + tup
            else:
                yield key, value


>>> nd = nesteddict()
>>> nd['new jersey']['mercer county']['plumbers'] = 3
>>> nd['new jersey']['mercer county']['programmers'] = 81
>>> nd['new jersey']['middlesex county']['programmers'] = 81
>>> nd['new jersey']['middlesex county']['salesmen'] = 62
>>> nd['new york']['queens county']['plumbers'] = 9
>>> nd['new york']['queens county']['salesmen'] = 36
>>> for tup in nd.walk():
    print tup


('new jersey', 'mercer county', 'programmers', 81)
('new jersey', 'mercer county', 'plumbers', 3)
('new jersey', 'middlesex county', 'programmers', 81)
('new jersey', 'middlesex county', 'salesmen', 62)
('new york', 'queens county', 'salesmen', 36)
('new york', 'queens county', 'plumbers', 9)
Coady
This is the answer that comes closest to what I was looking for. But ideally there would be all sorts of helper functions, e.g. walk_keys() or such. I'm surprised there's nothing in the standard libraries to do this.
YGA
+1  A: 

As others have suggested, a relational database could be more useful to you. You can use a in-memory sqlite3 database as a data structure to create tables and then query them.

import sqlite3

c = sqlite3.Connection(':memory:')
c.execute('CREATE TABLE jobs (state, county, title, count)')

c.executemany('insert into jobs values (?, ?, ?, ?)', [
    ('New Jersey', 'Mercer County',    'Programmers', 81),
    ('New Jersey', 'Mercer County',    'Plumbers',     3),
    ('New Jersey', 'Middlesex County', 'Programmers', 81),
    ('New Jersey', 'Middlesex County', 'Salesmen',    62),
    ('New York',   'Queens County',    'Salesmen',    36),
    ('New York',   'Queens County',    'Plumbers',     9),
])

# some example queries
print list(c.execute('SELECT * FROM jobs WHERE county = "Queens County"'))
print list(c.execute('SELECT SUM(count) FROM jobs WHERE title = "Programmers"'))

This is just a simple example. You could define separate tables for states, counties and job titles.

Roberto Bonvallet
+12  A: 
class AutoVivification(dict):
    """Implementation of perl's autovivification feature."""
    def __getitem__(self, item):
        try:
            return dict.__getitem__(self, item)
        except KeyError:
            value = self[item] = type(self)()
            return value

Testing:

a = AutoVivification()

a[1][2][3] = 4
a[1][3][3] = 5
a[1][2]['test'] = 6

print a

Output:

{1: {2: {'test': 6, 3: 4}, 3: {3: 5}}}
nosklo
Very nice, I just started looking for a similar thing and wanted to write my own question about this.
Gnudiff