views:

295

answers:

3

Pardon me for the woolly question - I'm not really that familiar with OLAP & cubes. Let me explain my situation...

I'd like to build a database to store questionnaire results, where there might be a few dozen questions per questionnaire. Having gathered a few thousand completed questionnaires, I'd like to analyze the results, and that sounds like a good candidate for OLAP type stuff (of which I know very little). I need to be able to run queries on "all male respondents age 20-30 who own a dog" - i.e. combining the answers to "how old are you", "do you own a dog", etc.

I also want to be able to store the results of next month's survey, and the month after that, etc., and run queries showing this month versus last, etc. So far, so good, I assume.

Here's the nub of my question: whereas this month my questionnaire might have questions about sex, age & dog ownership, next month's questionnaire might include a question about (say) eye color. It might (or might not) also drop some questions. Is that do-able in the OLAP world, or do you need to know all the "dimensions" (if I'm using the right term) in advance when you design your cube?

Also, if I'm running several different surveys with different-but-overlapping questions, can I store them all in the same cube and run queries across surveys? Each survey might have a few dozen questions, with a couple of dozen overlapping with other surveys. Do OLAP systems cater for this sort of thing? I just don't know how rigid they are, and whether they are in fact appropriate for this kind of usage.

Any help greatly appreciated.

PS. Before someone suggests it, I did just buy Kimball's Data Warehouse Toolkit but haven't had a chance to read it yet. (I suspect it may not directly answer this question anyway).

+1  A: 

I'll start by saying that I'm an OLAP newbie too but I think I have a handle on what you are looking to achieve.

In effect your questions are one of your dimensions, the answer to that question being part of the fact table, i.e. the fact table has the answer and has dimensions associated with it for age, sex, locality (perhaps), questions. It may feel a bit back to front but that's something that I'm coming to terms with for OLAP.

You might also want another dimension related to question that groups them into questionnaires but that might just be a value in the question dimension itself, i.e. Question { QuestionnaireID = 1, QuestionNumber = 4, QuestionText = "Do you own a dog?" }.

Not sure if that helps but hopefully will give you some ideas if nothing else.

Lazarus
Thanks for your answer. I'm not 100% sure I follow, though. Are you saying that "question" is one dimension, or many? (I might have dozens of questions, and I read somewhere that it's not a good plan to have more than 12 dimensions in a cube). Maybe what you mean is that *some* questions are expressed as dimensions (i.e. those that I need to cut the data by), but then what happens to the others? Having a single dimension for questions also sounds fraught, as the answers to different questions can vary widely - some might have a numeric answer, some will be multiple choice, some dates...
Gary McGill
A: 

Another OLAP newbie here as well...

1) I only have experience creating OLAP cubes with Mondrian (Pentaho), which does allow you to revise the cube's schema, which is just an XML file, and rebuild them (or in Pentaho-speak, publish). So for that platform, anyway, there's no such requirements for knowing all your dimensions ahead of time

2) I agree with Lazurus' recommendation about creating a dimension of questions. It's not a requirement that each of your "facts" has a value that's present in all dimensions, so if you were to look across the dimension for "Question n", then I believe it should only give you data for the questionnaires where "Question n" is a relevant dimension.

Kyril
Thanks. I didn't mention that I'd like my users to have control of their questionnaires, so ideally they'd use a (web) UI to add questions - this "re-publish the cube" step sounds a bit clunky; it's good to know I need to worry about that.Sadly, there's no guarantee that a question will have the same question number in all questionnaires, but I guess all I really need is some ID that can be used to tie related questions together.Nobody so far is saying that the number of quesions (each of which might have dozens of possible answers) is a problem. That's encouraging.
Gary McGill
If you're have any experiance with relational modelling, try creating a relational model of your data first. After that (and after reading Kimball's book), you will have much better idea of how to translate relational design to dimensional design.For questions, I'd create one dimenshion with all questions and answers, in form [QuestionnaireID,QuestionnaireText,QuestionID,QuestionText,AnswerID,AnswerText]. Put user's answers into fact table, and you will be able to slice all answers of all users by questionnaire, question, and answer.
Sergey Volegov
+1  A: 

There is a white paper here which has a section covering modelling survey data. This maybe the sort of thing that you are looking for.

Darren Gosbell
This is perfect, thanks!
Gary McGill