ansaurus

Question

Database design of survey query system

Answer 1

+6 A:

Have you checked DatabaseAnswers to see if there is a schema you could use as a starting point?

Mitch Wheat 2009-02-25 07:06:41

awesome resource, thanks

Jason 2009-02-25 07:14:52

Thanks! I don't know that site. But I doubt it will include such specific case. Anyway, I will take a look.

peng 2009-02-25 07:17:30

Well, there is a Questionnaire - Complex answer on it. But it uses the same design as the 2nd one I posted. It will cause huge table and sluggish query. Moreover, I prefer lighter weight design than that one given the problem.

peng 2009-02-25 07:24:41

Answer 2

+1 A:

Sounds like a case for a star schema.

You would have a (huge) fact table like this:

question_id, survey_id, age_group_id, health_classifier_id, is_smoking ... , answer_value

and denormalised dimension tables:

age_group: group_name, min_age, max_age, age_group_id

1.4 million rows doesn't sound like much for a system like this.

Some databases have special features to support querying on this kind of schema:

On Oracle those would be:

'dimensions' for supporting aggragation allong dimensions
bitmap index for filtering on low cardinality attributes like age_group_id and is_smoking
bitmap joind index for filtering on low cardinality attributes in a joined table, i.e. selecting from the fact table but filtering on min_age in the age_group table.
partitioning tables to handle large tables
materialized views for precalculating aggregation results

There are also specialised db systems for this kind of data called multidimensional database.

check if there are similiar constructs for your database or consider switching the database engine

Jens Schauder 2009-02-25 07:29:34

+1: answers are the fact table. Questionnaire is a slowly changing dimension.

S.Lott 2009-02-25 14:50:10

Thanks! But the star scheme won't be able to handle the change in questionnaire each year if I am not wrong. Otherwise, should we keep changing the fact table?

peng 2009-02-25 14:58:34

Answer 3

+1 A:

you need at least 3 tables:

1) Questions which contains the text for each question, with autoincrement id key

eg: (123, "What is the colour of your hair?")

2) Questionaires, which map Q#'s onto questions.

eg) question #10 on questionaire #3 maps on to question #123.

3) Answers, which link each respondant with their questionaire and the data

eg) Bob's response to question #10 on questionaire #3 is "brown".

You should see how easy it is to add new questionaires using existing questions and adding new questions. Yes, there are going to be huge tables, but a good database engine should be able to handle 1M entries easily. You could use partitioning to make it more efficient, such as partition by year.

I'll leave it as an exercise on how to convert this into sql.

dar7yl 2009-02-25 07:29:39

This is pretty much what I've done in the normalized design. Before normalizing, a query returned in no time. But using this scheme, it took about 3-4 secs to process the above query without indexing and partitioning. Just think this is not very optimal even though I can index and partition.

peng 2009-02-25 15:10:24

Answer 4

A:

I've been also thinking after my post on stackoverflow. Here is how I can use the denormalized wide table (80+ columns) to support question changing every year and also aggregate cross tabulation. Please comment on. Thanks

Create a new table for each year with the questions placed on columns e.g.

id year age sex educ income ... smoke hiv drink ...
Create two tables: Question and Query_Year, a many-to-many table Question_Year. Then we can populate a list of questions that are available for a specified year, and vice versa.
Queries within one year is easy. And queries cross years, we can use a UNION operator. Since the questions should be compatible among the selected years, UNION is legitimate. e.g

SELECT * FROM ( SELECT id, , , COUNT(*) FROM survey_2001 UNION ALL SELECT id, , , COUNT(*) FROM survey_2003 UNION ALL SELECT id, , , COUNT(*) FROM survey_2004 UNION ALL etc etc ) WHERE ( AGE in (...) AND EDUC in (...) AND etc etc ) GROUP BY ,

I suppose UNION is a relational operator which should not decrease the efficiency of a RDBMS. So it does not hurt if I combine many tables by union. The engine can also do some query analysis to boost the speed.

I think this one is adequate and simple enough. Please comment on. Thanks!

peng 2009-02-25 15:26:02

ansaurus

tags:

views:

answers:

Database design of survey query system

related questions