views:

418

answers:

4

I am trying to figure out the best way to model a spreadsheet (from the database point of view), taking into account :

  • The spreadsheet can contain a variable number of rows.
  • The spreadsheet can contain a variable number of columns.
  • Each column can contain one single value, but its type is unknown (integer, date, string).
  • It has to be easy (and performant) to generate a CSV file containing the data.

I am thinking about something like :

class Cell(models.Model):
    column = models.ForeignKey(Column)
    row_number = models.IntegerField()    
    value = models.CharField(max_length=100)

class Column(models.Model):
    spreadsheet = models.ForeignKey(Spreadsheet)
    name = models.CharField(max_length=100)
    type = models.CharField(max_length=100)

class Spreadsheet(models.Model):
    name = models.CharField(max_length=100)
    creation_date = models.DateField()

Can you think about a better way to model a spreadsheet ? My approach allows to store the data as a String. I am worried about it being too slow to generate the CSV file.

+2  A: 

You may want to study EAV (Entity-attribute-value) data models, as they are trying to solve a similar problem.

Entity-Attribute-Value - Wikipedia

Turnkey
+4  A: 

from a relational viewpoint:

Spreadsheet <-->> Cell : RowId, ColumnId, ValueType, Contents

there is no requirement for row and column to be entities, but you can if you like

Steven A. Lowe
This requires a PIVOT to be useful; pivoting is complex and hard to understand for the new user. And if your DB doesn't have a PIVOT function, your app won't scale worth a damn. *avoids Knuth's steely gaze*
Will
@[Will]: no one said modeling a spreadsheet in a relational database would be easy ;-)
Steven A. Lowe
To generate the CSV file you do not need to pivot. Instead "SELECT contents FROM cell ORDER BY RowId, ColumnId" and fill in extra columns when ColumnId jumps by more than 1, and extra rows when RowId jumps by more than 1
WW
+1  A: 

The best solution greatly depends of the way the database will be used. Try to find a couple of top use cases you expect and then decide the design. For example if there is no use case to get the value of a certain cell from database (the data is always loaded at row level, or even in group of rows) then is no need to have a 'cell' stored as such.

Aleris
It will be used a web based MS Excel clone. The user mostly edit cells, but adding/deleting rows and columns are also common operations. As a plus, the user can download data as CSV any time.
Guido
+1  A: 

Databases aren't designed for this. But you can try a couple of different ways.

The naiive way to do it is to do a version of One Table To Rule Them All. That is, create a giant generic table, all types being (n)varchars, that has enough columns to cover any forseeable spreadsheet. Then, you'll need a second table to store metadata about the first, such as what Column1's spreadsheet column name is, what type it stores (so you can cast in and out), etc. Then you'll need triggers to run against inserts that check the data coming in and the metadata to make sure the data isn't corrupt, etc etc etc. As you can see, this way is a complete and utter cluster. I'd run screaming from it.

The second option is to store your data as XML. Most modern databases have XML data types and some support for xpath within queries. You can also use XSDs to provide some kind of data validation, and xslts to transform that data into CSVs. I'm currently doing something similar with configuration files, and its working out okay so far. No word on performance issues yet, but I'm trusting Knuth on that one.

The first option is probably much easier to search and faster to retrieve data from, but the second is probably more stable and definitely easier to program against.

It's times like this I wish Celko had a SO account.

Will