views:

196

answers:

2

I have some data. 224,000 rows of it, in a SQLite database. I want to extract time series information from it to feed a data visualisation tool. Essentially, each row in the db is an event that has (among other things not strictly relevant) a time-date group in seconds since the epoch and a name responsible for it. I want to extract how many events each name has for every week in the db.

That's simple enough:

SELECT COUNT(*), 
       name, 
       strf("%W:%Y", time, "unixepoch") 
  FROM events 
 GROUP BY strf("%W:%Y", time, "unixepoch"), name 
 ORDER BY time

and we get about six thousand rows of data.

count          name        week:year  
23............ fudge.......23:2009  
etc...

But I don't want a row for each name in each week - I want a row for each name, and a column for each week, like this:

Name      23:2009       24:2009    25:2009  
fudge........23............6............19  
fish.........1.............0............12  
etc...

Now, the monitoring process has been running for 69 weeks, and the count of unique names is 502. So clearly, I'm far from keen on any solution that involves hardcoding all the columns and still less the rows. I'm less unkeen on anything that involves iterating over the lot, say with python's executemany(), but I'm willing to accept it if necessary. SQL is meant to be set-wise, dammit.

+1  A: 

A good approach in cases like this is not to push SQL to the point where it becomes convoluted and hard to understand and maintain. Let SQL do what it conveniently can and post-process the query results in Python.

Here's a cut-down version of a simple crosstab generator that I wrote. The full version delivers row/column/grand totals.

You'll note that it has built-in "group by" -- the original use-case was for summarising data obtained from Excel files using Python and xlrd.

The row_key and col_key that you supply don't need to be strings as in the example; they can be tuples -- e.g. (year, week) in your case -- or they could be integers -- e.g. you have a mapping of string column name to integer sort key.

import sys

class CrossTab(object):

    def __init__(
        self,
        missing=0, # what to return for an empty cell. Alternatives: '', 0.0, None, 'NULL'
        ):
        self.missing = missing
        self.col_key_set = set()
        self.cell_dict = {}
        self.headings_OK = False

    def add_item(self, row_key, col_key, value):
        self.col_key_set.add(col_key)
        try:
            self.cell_dict[row_key][col_key] += value
        except KeyError:
            try:
                self.cell_dict[row_key][col_key] = value
            except KeyError:
                self.cell_dict[row_key] = {col_key: value}

    def _process_headings(self):
        if self.headings_OK:
            return
        self.row_headings = list(sorted(self.cell_dict.iterkeys()))
        self.col_headings = list(sorted(self.col_key_set))
        self.headings_OK = True

    def get_col_headings(self):
        self._process_headings()
        return self.col_headings

    def generate_row_info(self):
        self._process_headings()
        for row_key in self.row_headings:
            row_dict = self.cell_dict[row_key]
            row_vals = [row_dict.get(col_key, self.missing) for col_key in self.col_headings]
            yield row_key, row_vals

    def dump(self, f=None, header=None, footer='', ):
        if f is None:
            f = sys.stdout
        alist = self.__dict__.items()
        alist.sort()
        if header is not None:
            print >> f, header
        for attr, value in alist:
            print >> f, "%s: %r" % (attr, value)
        if footer is not None:
            print >> f, footer

if __name__ == "__main__":

    data = [
        ['Rob', 'Morn', 240],
        ['Rob', 'Aft',  300],
        ['Joe', 'Morn',  70],
        ['Joe', 'Aft',   80],
        ['Jill', 'Morn', 100],
        ['Jill', 'Aft',  150],
        ['Rob', 'Aft',   40],
        ['Rob', 'aft',    5],
        ['Dozy', 'Aft',   1],
        # Dozy doesn't show up till lunch-time
        ['Nemo', 'never', -1],
        ]
    NAME, TIME, AMOUNT = range(3)
    xlate_time = {'morn': "AM", "aft": "PM"}

    print
    ctab = CrossTab(missing=None, )
    # ctab.dump(header='=== after init ===')
    for s in data:
        ctab.add_item(
            row_key=s[NAME],
            col_key= xlate_time.get(s[TIME].lower(), "XXXX"),
            value=s[AMOUNT])
        # ctab.dump(header='=== after add_item ===')
    print ctab.get_col_headings()
    # ctab.dump(header='=== after get_col_headings ===')
    for x in ctab.generate_row_info():
        print x

Output:

['AM', 'PM', 'XXXX']
('Dozy', [None, 1, None])
('Jill', [100, 150, None])
('Joe', [70, 80, None])
('Nemo', [None, None, -1])
('Rob', [240, 345, None])
John Machin
+1  A: 

I would first do your query

SELECT COUNT(*), 
       name, 
       strf("%W:%Y", time, "unixepoch") 
  FROM events 
 GROUP BY strf("%W:%Y", time, "unixepoch"), name 
 ORDER BY time

and then do post processing with python.

So you don't have to iterate over 224,000 rows but over 6,000 rows. You can easyly store those 6,000 rows in memory (for processing with Python). I think you can store 224,000 rows in memory too but it takes quite a lot more memory.

However: New versions of sqlite support the aggregation function group_concat. Maybe you can use this function for pivoting with SQL? I can't try because I use an older version.

tuinstoel