views:

619

answers:

4

Hello, guys

I am working on a so-called Behavioral Risk Factor Surveillance System (BRFSS), a web query system dealing with questionnaires coming every year.

I had hard time in coming up with a suitable database design for it. Here is the problem: Each questionnaire contains about 80 questions, with demographic info, e.g. age, education, etc, and survey questions, e.g. smoking, health, etc. Every year, some questions change, some don't. Data source is an Excel file with 80+ columns. The system has to support queries like:

SELECT [question var], [demo var], count(*) 
FROM survey 
WHERE age in (...) AND educ in (...) [etc] 
GROUP BY <question var>

The data is read only, ie. never change once imported. So it does not have to be normalized too much. Intuitively, a spreadsheet-like table will do a good job wrt. speed and space. This becomes a problem, though, because questions will change, then we can't keep all the data in this table, which is necessary because of cross year queries.

I tried normalize the responses into three tables: questions, responses, and response_values, which can support question variations. But then the response table spans over 98*14268 = 1,398,264 rows for one year! That's really huge. Query is slow like crazy!

How should I design the database? Any help is appreciated! Thanks in advance!

ps. I am using Python+Django+Sqlite.

+6  A: 

Have you checked DatabaseAnswers to see if there is a schema you could use as a starting point?

Mitch Wheat
awesome resource, thanks
Jason
Thanks! I don't know that site. But I doubt it will include such specific case. Anyway, I will take a look.
peng
Well, there is a Questionnaire - Complex answer on it. But it uses the same design as the 2nd one I posted. It will cause huge table and sluggish query. Moreover, I prefer lighter weight design than that one given the problem.
peng
+1  A: 

Sounds like a case for a star schema.

You would have a (huge) fact table like this:

question_id, survey_id, age_group_id, health_classifier_id, is_smoking ... , answer_value

and denormalised dimension tables:

age_group: group_name, min_age, max_age, age_group_id

1.4 million rows doesn't sound like much for a system like this.

Some databases have special features to support querying on this kind of schema:

On Oracle those would be:

  • 'dimensions' for supporting aggragation allong dimensions
  • bitmap index for filtering on low cardinality attributes like age_group_id and is_smoking
  • bitmap joind index for filtering on low cardinality attributes in a joined table, i.e. selecting from the fact table but filtering on min_age in the age_group table.
  • partitioning tables to handle large tables
  • materialized views for precalculating aggregation results

There are also specialised db systems for this kind of data called multidimensional database.

check if there are similiar constructs for your database or consider switching the database engine

Jens Schauder
+1: answers are the fact table. Questionnaire is a slowly changing dimension.
S.Lott
Thanks! But the star scheme won't be able to handle the change in questionnaire each year if I am not wrong. Otherwise, should we keep changing the fact table?
peng
+1  A: 

you need at least 3 tables:

1) Questions which contains the text for each question, with autoincrement id key

eg: (123, "What is the colour of your hair?")

2) Questionaires, which map Q#'s onto questions.

eg) question #10 on questionaire #3 maps on to question #123.

3) Answers, which link each respondant with their questionaire and the data

eg) Bob's response to question #10 on questionaire #3 is "brown".

You should see how easy it is to add new questionaires using existing questions and adding new questions. Yes, there are going to be huge tables, but a good database engine should be able to handle 1M entries easily. You could use partitioning to make it more efficient, such as partition by year.

I'll leave it as an exercise on how to convert this into sql.

dar7yl
This is pretty much what I've done in the normalized design. Before normalizing, a query returned in no time. But using this scheme, it took about 3-4 secs to process the above query without indexing and partitioning. Just think this is not very optimal even though I can index and partition.
peng
A: 

I've been also thinking after my post on stackoverflow. Here is how I can use the denormalized wide table (80+ columns) to support question changing every year and also aggregate cross tabulation. Please comment on. Thanks

  1. Create a new table for each year with the questions placed on columns e.g.

    id year age sex educ income ... smoke hiv drink ...

  2. Create two tables: Question and Query_Year, a many-to-many table Question_Year. Then we can populate a list of questions that are available for a specified year, and vice versa.

  3. Queries within one year is easy. And queries cross years, we can use a UNION operator. Since the questions should be compatible among the selected years, UNION is legitimate. e.g

    SELECT * FROM ( SELECT id, , , COUNT(*) FROM survey_2001 UNION ALL SELECT id, , , COUNT(*) FROM survey_2003 UNION ALL SELECT id, , , COUNT(*) FROM survey_2004 UNION ALL etc etc ) WHERE ( AGE in (...) AND EDUC in (...) AND etc etc ) GROUP BY ,

I suppose UNION is a relational operator which should not decrease the efficiency of a RDBMS. So it does not hurt if I combine many tables by union. The engine can also do some query analysis to boost the speed.

I think this one is adequate and simple enough. Please comment on. Thanks!

peng