views:

186

answers:

3

I've written some very basic tools for grouping, pivoting, unioning and subtotaling datasets sourced from non DB sources (eg: CSV, OLTP systems). The "group by" methods sit at the core of most of these.

However i'm sure lot of work has been done in making efficient algorithms for grouping data... and i'm sure i'm not using them. And my Google-fu has completely failed to turn anything up.

Are there any good online sources or books describing the better methods to create grouped data?

Or should i just start looking at the MySQL source or something similar?

+4  A: 

One very handy way to "group by" some field (or set of fields and expressions, but I'll use "field" for simplicity!-) is when you can arrange to walk over the results before grouping (RBG) in a sorted way -- you actually don't care about the sorting (save in the common case in which an ORDER BY is also there and just happens to be on the same field as the GROUP BY!-), but rather about the "side effect" property of ordering -- that all rows in RBG with the same value for the grouping field come right after each other, so you can accumulate until the grouping field changes, then emit/yield the results accumulated so far, and proceed to reinitialize the accumulators with the new row (the one with a different value of the grouping field) -- make sure to "just initialize the accumulators" at the very start, AND "just emit/yield accumulated results" at the very end, of course.

If this doesn't work, maybe you can hash the grouping field and use a hash table for the results being accumulated for that group -- at each row in RBG, hash the grouping field, check if it was already present as a key in the hash table, if not put it there with accumulators suitably initialized from the RBG row, else update the accumulators per the RBG row. You just emit everything at the end. The problem of course is you're taking up more memory until the end!-)

These are the two fundamental approaches. Would you like pseudocode for each, BTW?

Alex Martelli
Thanks Alex, these make complete sense and i'm using the first. Do you know of any good sources for algorithms in this space? or is this just personal experience?
Mark Nold
Sorry, it's basically personal experience -- from back when one had to implement such things oneself (on top of ISAM or such things as rudimental early bsd-db), as lightweight embedded SQL engines were non-existent or very pricey (nowadays I tend to just use SQLite when I need an embedded engine;-).
Alex Martelli
Good point Alex, i've looked at SQLlite and it does look good. Looking back I seem to have implemented this same solution in various languages from C and Perl to VBA :)
Mark Nold
+1  A: 

You should check out OLAP databases. OLAP allows you to create a database of aggregates meant to be analyzed in a "slice and dice" fashion.

Aggregate measures such as counts, averages, mins, maxs, sums and stdev's can be quickly analyzed by any number of dimensions using an OLAP database.

See this introduction to OLAP on MSDN.

jn29098
Thanks jn29098. Probably not what i was looking for, but a nice link on OLAP :)
Mark Nold
A: 

Give an example CSV file and type of result wanted and I might be able to rustle up a a solution in Python for you.

Python has the CSV module and list/generator comprehensions that can help with this sort of thing.

  • Paddy.
Paddy3118
Thanks Paddy, i'm looking more for an algorithm rather than a specific solution (i have one). I'm optimising so i want to make sure i've done nothing stupid :)
Mark Nold