Hello, guys
I am working on a so-called Behavioral Risk Factor Surveillance System (BRFSS), a web query system dealing with questionnaires coming every year.
I had hard time in coming up with a suitable database design for it. Here is the problem: Each questionnaire contains about 80 questions, with demographic info, e.g. age, education, etc, and survey questions, e.g. smoking, health, etc. Every year, some questions change, some don't. Data source is an Excel file with 80+ columns. The system has to support queries like:
SELECT [question var], [demo var], count(*)
FROM survey
WHERE age in (...) AND educ in (...) [etc]
GROUP BY <question var>
The data is read only, ie. never change once imported. So it does not have to be normalized too much. Intuitively, a spreadsheet-like table will do a good job wrt. speed and space. This becomes a problem, though, because questions will change, then we can't keep all the data in this table, which is necessary because of cross year queries.
I tried normalize the responses into three tables: questions, responses, and response_values, which can support question variations. But then the response table spans over 98*14268 = 1,398,264 rows for one year! That's really huge. Query is slow like crazy!
How should I design the database? Any help is appreciated! Thanks in advance!
ps. I am using Python+Django+Sqlite.