views:

37

answers:

1

I'm creating a data-entry application where users are allowed to create the entry schema.

My first version of this just created a single table per entry schema with each entry spanning a single or multiple columns (for complex types) with the appropriate data type. This allowed for "fast" querying (on small datasets as I didn't index all columns) and simple synchronization where the data-entry was distributed on several databases.

I'm not quite happy with this solution though; the only positive thing is the simplicity... I can only store a fixed number of columns. I need to create indexes on all columns. I need to recreate the table on schema changes.

Some of my key design criterias are:

  • Very fast querying (Using a simple domain specific query language)
  • Writes doesn't have to be fast
  • Many concurrent users
  • Schemas will change often
  • Schemas might contain many thousand columns
  • The data-entries might be distributed and needs syncronization.
  • Preferable MySQL and SQLite - Databases like DB2 and Oracle is out of the question.
  • Using .Net/Mono

I've been thinking of a couple of possible designs, but none of them seems like a good choice.

Solution 1: Union like table containing a Type column and one nullable column per type.

This avoids joins, but will definitly use a lot of space.

Solution 2: Key/value store. All values are stored as string and converted when needed.

Also use a lot of space, and of course, I hate having to convert everything to string.

Solution 3: Use an xml database or store values as xml.

Without any experience I would think this is quite slow (at least for the relational model unless there is some very good xpath support). I also would like to avoid an xml database as other parts of the application fits better as a relational model, and being able to join the data is helpful.

I cannot help to think that someone has solved (some of) this already, but I'm unable to find anything. Not quite sure what to search for either...

I know market research is doing something like this for their questionnaires, but there are few open source implementations, and the ones I've found doesn't quite fit the bill.

PSPP has much of the logic I'm thinking of; primitive column types, many columns, many rows, fast querying and merging. Too bad it doesn't work against a database.. And of course... I don't need 99% of the provided functionality, but a lot of stuff not included.

I'm not sure this is the right place to ask such a design related question, but I hope someone here has some tips, know of any existing work, or can point me to a better place to ask such a question.

Thanks in advance!

A: 

Have you already considered the most trivial solution: having one table for each of your datatypes and storing the schema of your dataset in the database as well. Most simple solution:

DATASET Table (Virtual "table")
ID - primary key
Name - Name for the dataset/table

COLUMNSCHEMA Table (specifies the columns for one "dataset")
DATASETID - int (reference to Dataset-table)
COLID - smallint (unique # of the column)
Name - varchar
DataType - ("varchar", "int", whatever)

Row Table 
DATASETID
ID - Unique id for the "row"

ColumnData Table (one for each datatype)
ROWID - int (reference to Row-table)
COLID - smallint
DATA - (varchar/int/whatever)

To query a dataset (a virtual table), you must then dynamically construct a SQL statement using the schema information in COLUMNSCHEMA table.

Kyberias