views:

239

answers:

3

Hi,

I'm working on a Django project where I need to provide a lot of different visualizations on the same data (for example average of a value for each month, for each year / for a location, etc...).

I have been using an OLAP database once in college, and I thought that it would fit my needs, but it appears that it is much too heavy for what I need. Actually the volume of data is not very big, so I don't need any optimization, just a way to present different visualizations of the same data without having to write 1000 times the same code.

So, to recap, I need a python library:

  • to emulate a multidimensional database (OLAP style would be nice because I think it is quite convenient : star structure, and everything)
  • non-intrusive, because I can't modify anything on the existing MySQL database
  • easy-to-use, because otherwise there's no point in replacing some overhead by another.
+1  A: 

Why not just use the standard ORM aggregation functions: http://docs.djangoproject.com/en/dev/topics/db/aggregation/

Wherever you think performance is going to be a hit, you can denormalize that field.

Lakshman Prasad
No ... Writing all these statistics calculations with Django Orm is extremely painful. As I said, I don't care about performance. I just need a convenient, powerful and flexible way to calculate aggregations on my data. I have been trying quite a lot, and writing quite a lot of code, and django ORM is clearly not suited: That is what multidimensional databases are actually made for !
sebpiq
A: 

You have Python defaultdict dictionaries.

If your data is small, simply query all of it and load up a bunch of dictionaries with counts and sums and what-not.

A "star schema" is just a poor-person's inverted database where dimensions (i.e., dictionaries) reference lists of rows. You can summarize those lists of rows to create summary dictionaries.

d1_sum= defaultdict( int )
d1_count= defaultdict( count )
d2_sum = defaultdict( int )
for row in MyFactTable.objects.all():
    d1_sum[row.attr1] += row.fact1
    d1_count[row.attr1] += 1
    d2_sum[row.attr2] += some_function( row.fact2 )
    etc.
S.Lott
Hi, thanks for your answer, you were right, but I decided to develop my own stuff because I still ended up writing a lot of code. You might want to check it out : https://code.google.com/p/django-cube/
sebpiq
+2  A: 

Ok ... I finally came up with my own solution ( https://code.google.com/p/django-cube/ ), because I couldn't find what I wanted.

With a model like this:

class Instrument(models.Model):
    name = models.CharField(max_length=100)

class Musician(models.Model):
    firstname = models.CharField(max_length=100)
    instrument = models.ForeignKey(Instrument)

Create a cube:

>>> c = Cube(['instrument__name', 'firstname'], Musician.objects.all(), len)
... #Cube(dimensions, queryset, aggregation_function)
... #You can use the Django field-lookup syntax for dates and foreign keys !!!

Query the cube along one (or several) dimension(s):

>>> c.measure_dict('firstname', 'instrument__name', full=False) == {
...     'Miles': {
...         'trumpet': {'measure': 1},
...         'sax': {'measure': 0},
...         'piano': {'measure': 0},
...     },
...     'John': {
...         'trumpet': {'measure': 0},
...         'sax': {'measure': 1},
...         'piano': {'measure': 4},
...     },
... }

Use the custom template tags, etc ...

sebpiq
Now the syntax is different for more ease of use, ...
sebpiq