tags:

views:

269

answers:

5

I'm planning to write a program in Ruby to analyse some data which has come back from an online questionnaire. There are hundreds of thousands of responses, and each respondent answers about 200 questions. Each question is multiple-choice, so there are a fixed number of possible responses to each.

The intention is to use a piece of demographic data given by each respondent to train a system which can then guess that same piece of demographic data (age, for example) from a respondent who answers the same questionnaire, but doesn't specify the demographic data.

So I plan to use a vector (in the mathematical sense, not in the data structure sense) to represent the answers for a given respondent. This means each vector will be large (over 200 elements), and the total data set will be huge. I plan to store the data in a MySQL database.

So. 2 questions:

1) How should I store this in the database? One row per response to a single question, or one row per respondent? Or something else?

2) I'm planning to use something like the k-nearest neighbour algorithm, or a simple machine learning algorithm like a naive bayesian classifier to learn to classify new responses. Should I manipulate the data purely through SQL or should I load it into memory and store it in some kind of vast array?

Thanks!

+3  A: 

First thing that comes to mind: Storing it in Memory can be absolutely reasonable for processing purposes. Lets say you reserve one byte for each answer, you have a million responses and 200 questions, then you have a 200 MB array. Not small but definitely not memory exhausting on a modern desktop, even with a 32 bit OS.

As for the database I think you should have three tables. One for the respondent with the demographical data, one for the questions, and, since you have a n:m relation between these tables, a third one with the Respondent-ID, the Question-ID and the Answercode.

If you don't need additional data for the questions (like the question-text or something) you can even optimize away the question table.

TToni
+3  A: 

Use an array of arrays, in memory. I just created a 500000x200 array and it required about 500MB of RAM. Easily manageable on a 2GB machine, and many, many orders of magnitude faster than using SQL.

Personally, I wouldn't bother putting the data in MySQL at all. Just Marshal it in and out, and/or use JSON or CSV.

glenn mcdonald
A: 

I'm not a great database person, so I'll just answer #2:

If you'd really like to save on memory (or foresee a situation where there will be a lot more data) you could take the best of both worlds: Use ruby as essentially a data-mining tool. Have it pull some of the data from the DB, then write the results back to the DB (probably under a different table or database altogether). This has the benefit of only using as much memory as you want it to.

Atiaxi
+2  A: 

If you definitely need database storage, and the comments elsewhere about alternatives are worth considering, then I'd advise against storing 200-odd responses in 200-odd rows: you don't seem to have any obvious need for the flexibility that such a design would give and performance across hundreds of thousands of respondents is going to be dire.

Using a RDBMS gives you the ability to store very large amounts of data, access them in a variety of multi-dimensional ways and extend the structure of your data ad hoc over time. But what you gain in flexibility over a flat file (or Marshalled, or other) option you often lose in performance. I have to confess to reaching for third normal form far too early myself. I guess the questions are, how much flexibility in querying do you expect to need, and how much change do you think your data is likely to undergo? If you think you're at the low end of both, consider leaving the SQL on the shelf. If you abstract your data access into a separate layer then changing should be cheap later. Just a thought...

I'd expect you can encode an individual's response in such a way that it can easily be used in code and it's unlikely to take more than 200 characters, less if you use some sort of packing or bit-mapping. I rather like the idea of bit-mapping, come to think of it - it makes simple comparison using something like Hamming distance an absolute breeze.

Mike Woodhouse
A: 

Don't forget that Ruby is a dynamic object language, as such, a simple integer will probably take up more space than a simple int in C. It needs additional space to be able to characterise if it has been 'garnished' with any additional information, methods etc.

fatgeekuk