tags:

views:

66

answers:

1

While working on a project that will store a whole bunch of (completely different) forms I'm facing a design issue on how to store the values while keeping the database usable.

Brief description: each 'document' contains a variable amount of questions (though a consistent amount per type of document) and matching answers.

The most usable approach I've come up with is the following, here I've grouped documents by 'type', which identifies which questions belong to the document, which in return has the answers to the matching questions.

  +---------------+ 1      n +-----------+
  | DocumentType  |----------| Questions |
  +---------------+ Has many +-----------+
         |1                      1|
         |n Is of type           n| Belongs to
  +---------------+ 1      n +-----------+
  | DocumentEntry |----------|  Answers  |
  +---------------+ Has many +-----------+

The drawback here is that queries on fetching the documents that have question A with answer B become rather complex and likely rather slow when the database grows larger, which it rapidly will.

I'm wondering if I've stumbled across the optimal approach to store the data or if there is some neat solution out there that I might've missed.

+1  A: 

You've faced a common problem: Trying to use something static (database with predefined structure) for something dynamic (bunch of individual data sets which only have one thing in common: they come from forms). What you want is doable with databases but would be significantly easier to do without, however since I assume you really do want to use a database for this, here's what I'd do:

  • You have a document (or questionnaire) which contains multiple questions. These both are generic enough and require their own tables, just as you've done so far.
  • Each question has a type which defines what kind of question it is (multiple select, freeform, select one...) and of course the question also has options. So that's two tables more. The reasoning here is that decoupling these from the actual question allows for certain level of abstraction exist and your queries will still be somewhat simple although possibly loooooong.

So, each document has 1..n to questions, each question has 1 type and 1..n options. Skipping a bit, here's what I'm thinking of with link tables etc.

Document
    bigint id
DocumentQuestions
    bigint document_id
    bigint question_id
Question
    bigint id
    varchar question
QuestionType
    bigint question_id
    bigint type_id
Type [pre-filled table with id:type pairs, such as 1=freeform, 2=select one etc.]
QuestionOptions
    bigint id
    bigint question_id
    varchar description
    varchar value

Answers
    bigint id
    bigint document_id
    [etc. such as user_id]
QuestionAnswers
    bigint answer_id
    bigint question_id
    bigint questionoptions_id

This sort of design permits several things:

  • Questions themselves are reusable, very handy if you're making a generic "answer these x random questions from a pool of y questions".
  • New types can be added easily without breaking existing ones.
  • You can always navigate through the structure quite easily, for example "What was the name of the document for this single question answer I have?" or "how many people have answered incorrectly to this one question?"
  • Because types are separated, you can create a (web) UI which reflects the state in the database easily - better yet, if the type changes you may don't even have to touch your UI code at all.
  • Since each possibly option for a question is its own row in the QuestionOptions table, you can get the actual value very easily.

The obvious problem with this is that it requires quite strict coupling between the tables for integrity and will be a pain to get running properly at start. Also since value in the QuestionOptions is varchar, you need to be able to parse stuff a lot and you may even want to introduce another field for conversion hints.

Hope this helps even though you wouldn't agree with my solution at all.

Esko
Thanks a bunch for your quick input and I'm happy to read your interpretation of my question, which should've been quite a bit more specific no my end.The decoupling the answers and 'type of answer' is indeed a great suggestion, sadly I should've mentioned it's safe to assume here that all answers are simple varchars in this case, no other types will exist. Upvoted for the obvious amount of thought you've given this issue, will accept it as solution unless someone else comes around - thanks a lot.
Leftblank
Thanks. Wait for a while before accepting though, that way others may still answer since I'm not entirely sure all the DB wizards of SO are awake at this hour. Oh and I've done something similar before and this is basically how I did it back then - that's also the source of the warning in the first paragraph, by scrapping the database with a completely proprietary solution we saw a **28 000%** performance increase in this system that used a similar thing (*not for question but something similar structurally*).
Esko
Probably not completely related indeed, but I'm curious what the solution was, would you mind sharing the name/type of proprietary solution used? Thanks!
Leftblank
Can't unfortunately tell what the end result was, but lets just say we all learned alot from this paper: http://labs.google.com/papers/mapreduce.html Yes, it's quite a leap from that paper to a solution that worked with a similar system but in the end everyone was pleased.
Esko