I've embarked on a project that is proving considerably more complicated than I'd first imagined. I'm trying to plan a system that is based around boolean (true/false) questions and answers. Users on the system can answer any questions from a large set of boolean (true/false) questions and be presented with a list showing the most similar users (in order of similarity) based on their answers.
I've Googled far and wide but still not come up with much, so I was hoping somebody could point me in the right direction. I'd like to know:
What is the best data structure and method to store this kind of data? I'd originally assumed I could create two tables "questions" and "answers" in an SQL database. However, I'm not wondering if it would be simpler to compare two sets of answers if they were both listed as numerical string. I.e. 0 = not answered, 1 = true, 2 = false. When comparing the strings weights could be added for "not answered" = 0, "same answer" = 1, "opposite answer" = -1 producing a similarity score.
How would I go about comparing two sets of answers? To be able to work out the "similarity" between these sets of answers I'm going to have to write a comparison function. Does anyone know what kind of comparison would best suite this problem? I've looked into sequence alignment and I think this could be the correct way to go but I'm unsure as this requires the data to be in a long sequence, plus the questions aren't related so aren't naturally a sequence.
How do I apply this comparison function to a large set of data? Once I've written the comparison function I could just compare each users answers to every other user's answers, however this doesn't seem very efficient and probably wouldn't scale very well. I've been looking into cluster analysis methods to automatically group users according to similar answers, do you think this could work or does anyone know a better method I could look into?
I'd really appreciate any helpful pointers. Thanks!