+2  A: 

you should use ddply from plyr and split on all of the columns if you want to take into account the different reasons, if you want to ignore them don't use those columns in the split. You'll need to clean up some of the question marks and extra stuff first though.

x <- ddply(data, c("split_column1", "split_column3" etc),
           summarize(result_df, stats you want from result_df))
Dan
+5  A: 

Make it "long":

library(reshape)
dnow <- read.csv("~/Downloads/catsample20100504.csv")
dnow <- melt(dnow, id.vars=c("Student", "instructor"))
dnow$variable <- NULL ## since ordering does not matter
subset(dnow, Student%in%c(41,42)) ## see the results

What to do next will depend on the kind of analysis you would like to do. But the long format is the useful for irregular data such as yours.

Eduardo Leoni
Frequency of each response `table(dnow$value)`, and "which students had a particular response?" `with(dnow, sort(Student[value == "3a"]))` might be good starting questions for an analysis.
Richie Cotton
@Eduardo - Thanks so much! I've never seen the `%in%` syntax before. What does it do?
briandk
@Richie - Thanks for your comments, and your patience! This is my first big data analysis. While I have a sense of what research questions are emerging ("do the frequencies of particular answer patterns vary by instructor?"), I'm not always sure how to instantiate them in R syntax.
briandk
briandk - `%in%` is basically a binary interface to `match()`, which you can actually look up with ?match (unlike ?%in%). Basically, you give it a vector of values on the left and a vector of values on the right, and it will give you a logical vector indicating which of the values in the left-hand vector are in the right-hand one. In Eduardo's code, it's giving `subset()` a vector of TRUE and FALSE to indicate which rows are students 41 and 42.
Matt Parker
@Matt: You can get help on `%in%` by quoting it, like so `?"%in%"`, though in this case it takes you to the same page as `?match`.
Richie Cotton
@briandk: You'll want something like `with(dnow, tapply(value, instructor, table))`, or visually (using `ggplot2`) `ggplot(subset(dnow, value != ""), aes(value)) + geom_bar() + facet_grid(instructor ~ .)`.
Richie Cotton
+1  A: 

What's the (bigger picture) question you're attempting to answer? Why is this information interesting to you?

Are you just trying to find patterns such as 'if the student does this, then they also likely do this'?

Something I'd consider if that's the case - split the data set into smaller random samples for your analysis to reduce the risk of false positives.

Interesting problem though!

Stray
@Stray - Thanks! Your suggestion's well-taken. My research questions are still in the formative stages, but one thing I plan to do is compare student performance by instructor, and answer patterns by instructor. Most of my early comparisons will involve [ANOVA](http://en.wikipedia.org/wiki/Anova) and sadly splitting my dataset will kill the [power](http://en.wikipedia.org/wiki/Statistical_power) on it.
briandk