views:

71

answers:

4

I've embarked on a project that is proving considerably more complicated than I'd first imagined. I'm trying to plan a system that is based around boolean (true/false) questions and answers. Users on the system can answer any questions from a large set of boolean (true/false) questions and be presented with a list showing the most similar users (in order of similarity) based on their answers.

I've Googled far and wide but still not come up with much, so I was hoping somebody could point me in the right direction. I'd like to know:

What is the best data structure and method to store this kind of data? I'd originally assumed I could create two tables "questions" and "answers" in an SQL database. However, I'm not wondering if it would be simpler to compare two sets of answers if they were both listed as numerical string. I.e. 0 = not answered, 1 = true, 2 = false. When comparing the strings weights could be added for "not answered" = 0, "same answer" = 1, "opposite answer" = -1 producing a similarity score.

How would I go about comparing two sets of answers? To be able to work out the "similarity" between these sets of answers I'm going to have to write a comparison function. Does anyone know what kind of comparison would best suite this problem? I've looked into sequence alignment and I think this could be the correct way to go but I'm unsure as this requires the data to be in a long sequence, plus the questions aren't related so aren't naturally a sequence.

How do I apply this comparison function to a large set of data? Once I've written the comparison function I could just compare each users answers to every other user's answers, however this doesn't seem very efficient and probably wouldn't scale very well. I've been looking into cluster analysis methods to automatically group users according to similar answers, do you think this could work or does anyone know a better method I could look into?

I'd really appreciate any helpful pointers. Thanks!

+1  A: 

If you were to set it up in SQL with tables for Users, Questions, and Answers then I believe that the following SQL could be used to get other users with similar responses. Simply add a TOP clause to get the number that you want.

I don't know how performance will be, but that would depend a lot on the size of your data.

SELECT
    U2.userid,
    SUM(CASE
            WHEN A1.answer = A2.answer THEN 1
            WHEN A1.answer <> A2.answer THEN -1
            WHEN A1.answer IS NULL OR A2.answer IS NULL THEN 0  -- A bit redundant, but I like to make it clear
            ELSE 0
        END) AS similarity_score
FROM
    Questions Q
LEFT OUTER JOIN Answers A1 ON
    A1.question_id = Q.question_id AND
    A1.userid = @userid
LEFT OUTER JOIN Answers A2 ON
    A2.question_id = A1.question_id AND
    A2.userid <> A1.userid
LEFT OUTER JOIN Users U2 ON
    U2.userid = A2.userid
GROUP BY
    U2.userid
ORDER BY
    similarity_score DESC
Tom H.
Thanks for the reply. I think this would work great on a smaller data set but wouldn't scale particularly well. If there were 500k users each with 100 answers then I think this would probably grind to a halt. I need something that will continue to work on a large scale and so for this to be feasible I imagine the data would need to be filtered somehow first.
gomezuk
I tried to think of a way to do this with a bit-map and came close, but you would need to be able to calculate the Hamming Weight of a value and since there's not an easy and efficient way to do that in a set-based manner it's a bit of a road block.
Tom H.
+1  A: 

Data Storage: I would say a database is a good idea (sounds like the potential for a rather large data set). I don't know how many questions you plan on having but to help with simplifying the analysis (including your SQL queries) a bit you may want to group answers to similar questions in separate tables. And I would agree using a numerical value (byte 0-2) would be a good route to take instead of a boolean or something else. You are computing a similarity score so might as well start with numbers.

Comparison: As far as the comparison itself, i would suggest creating an class SimilarQuestionAnswers that contains a list of bytes and a class UserAnswers that contains a list of these SimilarQuestionAnswers. What this does is it sets up your clusters for the cluster analysis method you mentioned. This way you can add weight to certain clusters. (cluster a is an important cluster so it's score is multiplied by 20 where as cluster b is not as important so its score is only multiplied by 10) This also allows you to apply different comparisons for each cluster if that is needed.

I know you said the questions aren't related but you can still at least group questions by their importance. I think the sequence analysis will still work granted your similarity matrix will be all 1's so that kinda simplifies the problem a bit, but the rest of the math associated with that should be sufficient.

Comparison Applied: This is where having the database back end comes in handy. Use SQL queries to minimize the dataset you are dealing with. If you are comparing one person with everyone else, you can use the SQL sum method on their answers to get a quick and dirty comparison within each cluster and pull only those within a certain threshold. This may result in some overlap but that can be eliminated easily.

Another thought is to also have a table with each user and a column for each cluster with a comparison to a fake user that has answered true to each question. Then you could just query that table for a range around the current users scores for each cluster. This my be faster but less accurate.

Either way in the end you will still need to do the comparison to each of the users you get from that query. So the faster you can make that comparison the better. Try to stick to a formula that involves only +,-,*,/ most of the Math.Whatever() methods can add a lot of time over a large number of calls.

Sorry this was so long, most of the questions were pretty open ended and I had to assume a few details. I hope this helps.

Jack
Thanks, some really useful ideas in there. I think there's potential in the idea of using a "fake user" or "control user" as a way of quickly comparing distance (similarity). However two users might have the same value of d (distance from control) yet have answered very differently. I think you might need to compare every user individually in order to build up a true comparison.
gomezuk
I agree that you still need to do a final comparison, I only meant for the control user comparison to be a rough cut to make your dataset you are doing the final comparison on smaller and more manageable. I assume no one user is really going to look at all n comparisons, probably just the top 5% if that.
Jack
+1  A: 

I would think you might want a per-question weight that was based on how all users responded. As an extreme case, if 1,000 people answered questions A & B, and the results were A (2Y, 998N) and B (500Y, 500N), the two 'Ys for A count much more than any given pair of Y's from B. And any similar pair from B is somewhat more similar than any pair of Ns from A.

Check out Bayesian Probability

Carl Manaster
I think you're absolutely right. In other words, for any given users' answer comparison, the less probable the matching answers are, the higher the similarity score should be. I could store a weight with each answer on the database that gets updated whenever the question is answered.
gomezuk
+1  A: 

Rather than cluster the users, you might also consider clustering the questions (e.g. OkCupid). Then instead of comparing users on all answers, you compare them on the categories.

Justin K
Could you explain a bit more about what you mean? I've had a look at OKCupid and it's very similar to what I'm planning to do. Do you know which classification system they use?
gomezuk
I imagine they manually classified questions by topic when the site was small and now have some automated way of doing it, but I don't have any inside knowledge.
Justin K