views:

142

answers:

3

I'm doing an ongoing survey, every quarter. We get people to sign up (where they give extensive demographic info).

Then we get them to answer six short questions with 5 possible values much worse, worse, same, better, much better.

Of course over time we will not get the same participants,, some will drop out and some new ones will sign up,, so I'm trying to decide how to best build a db and code (hope to use Python, Numpy?) to best allow for ongoing collection and analysis by the various categories defined by the initial demographic data..As of now we have 700 or so participants, so the dataset is not too big. I.E.; demographic, UID, North, south, residential. commercial Then answer for 6 questions for Q1 Same for Q2 and so on,, then need able to slice dice and average the values for the quarterly answers by the various demographics to see trends over time.

The averaging, grouping and so forth is modestly complicated by having differing participants each quarter

Any pointers to design patterns for this sort of DB? and analysis? Is this a sparse matrix?

+1  A: 

This is a Data Warehouse. Small, but a data warehouse.

You have a Star Schema.

You have Facts:

  • response values are the measures

You have Dimensions:

  • time period. This has many attributes (year, quarter, month, day, week, etc.) This dimension allows you to accumulate unlimited responses to your survey.

  • question. This has some attributes. Typically your questions belong to categories or product lines or focus or anything else. You can have lots question "category" columns in this dimension.

  • participant. Each participant has unique attributes and reference to a Demographic category. Your demographic category can -- very simply -- enumerate your demographic combinations. This dimension allows you to follow respondents or their demographic categories through time.

But Ralph Kimball's The Data Warehouse Toolkit and follow those design patterns. http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247
Buy the book. It's absolutely essential that you fully understand it all before you start down a wrong path.

Also, since you're doing Data Warehousing. Look at all the [Data Warehousing] questions on Stack Overflow. Read every Data Warehousing BLOG you can find.

There's only one relevant design pattern -- the Star Schema. If you understand that, you understand everything.

S.Lott
This answer begins to get at the issue. I'm not familiar with Ralph Kimbrall? The last bullit point is what I'm hung up on,, trying to figure out how to structure data/tables for easiest processing?
dartdog
Thanks for the pointers,, But I'm a bit over on my reading commitments and I know my problem is simple so I guess I'll hit the blogs,, I'll kep the book in mind for future,,, I'll check references to Star Schema
dartdog
@dartdog: Your problem is not simple. Rushing into a bad solution will -- in the long run -- be a poor investment of your time.
S.Lott
Hey I agree that's why I'm asking!! But I don't want to read the whole subject, for what is a fairly simple survey!
dartdog
@dartdog. (1) I gave you the entire design. (2) If you didn't understand it, you should read *something* on the topic. All that's left is for me to write the code for you, which seems inappropriate. (3) It isn't "fairly simple". It's serious business.
S.Lott
Well I think I now have enough to get started, thanks
dartdog
A: 

On the analysis, if your six questions have been posed in a way that would lead you to believe the answers will be correlated, consider conducting a factor analysis on the raw scores first. Often comparing the factors across regions or customer type has more statistical power than comparing across questions alone. Also, the factor scores are more likely to be normally distributed (they are the weighted sum of 6 observations) while the six questions alone would not. This allows you to apply t-tests based on the normal distibution when comparing factor scores.

One watchout, though. If you assign numeric values to answers - 1 = much worse, 2 = worse, etc. you are implying that the distance between much worse and worse is the same as the distance between worse and same. This is generally not true - you might really have to screw up to get a vote of "much worse" while just being a passive screw up might get you a "worse" score. So the assignment of cardinal (numerics) to ordinal (ordering) has a bias of its own.

The unequal number of participants per quarter isn't a problem - there are statistical t-tests that deal with unequal sample sizes.

Grembo
+1  A: 

Regarding the survey analysis portion of your question, I would strongly recommend looking at the survey package in R (which includes a number of useful vignettes, including "A survey analysis example"). You can read about it in detail on the webpage "survey analysis in R". In particular, you may want to have a look at the page entitled database-backed survey objects which covers the subject of dealing with very large survey data.

You can integrate this analysis into Python with RPy2 as needed.

Shane
This sounds interesting have been looking at R may need to dive into it! The Links look very helpful!
dartdog