ansaurus

Question

How do I write a function to compare and rank many sets of boolean (true/false) answers?

Answer 1

+1 A:

If you were to set it up in SQL with tables for Users, Questions, and Answers then I believe that the following SQL could be used to get other users with similar responses. Simply add a TOP clause to get the number that you want.

I don't know how performance will be, but that would depend a lot on the size of your data.

SELECT
    U2.userid,
    SUM(CASE
            WHEN A1.answer = A2.answer THEN 1
            WHEN A1.answer <> A2.answer THEN -1
            WHEN A1.answer IS NULL OR A2.answer IS NULL THEN 0  -- A bit redundant, but I like to make it clear
            ELSE 0
        END) AS similarity_score
FROM
    Questions Q
LEFT OUTER JOIN Answers A1 ON
    A1.question_id = Q.question_id AND
    A1.userid = @userid
LEFT OUTER JOIN Answers A2 ON
    A2.question_id = A1.question_id AND
    A2.userid <> A1.userid
LEFT OUTER JOIN Users U2 ON
    U2.userid = A2.userid
GROUP BY
    U2.userid
ORDER BY
    similarity_score DESC

Tom H. 2010-07-23 17:41:43

Thanks for the reply. I think this would work great on a smaller data set but wouldn't scale particularly well. If there were 500k users each with 100 answers then I think this would probably grind to a halt. I need something that will continue to work on a large scale and so for this to be feasible I imagine the data would need to be filtered somehow first.

gomezuk 2010-07-23 18:40:24

I tried to think of a way to do this with a bit-map and came close, but you would need to be able to calculate the Hamming Weight of a value and since there's not an easy and efficient way to do that in a set-based manner it's a bit of a road block.

Tom H. 2010-07-23 19:30:51

Answer 2

+1 A:

Data Storage: I would say a database is a good idea (sounds like the potential for a rather large data set). I don't know how many questions you plan on having but to help with simplifying the analysis (including your SQL queries) a bit you may want to group answers to similar questions in separate tables. And I would agree using a numerical value (byte 0-2) would be a good route to take instead of a boolean or something else. You are computing a similarity score so might as well start with numbers.

Comparison: As far as the comparison itself, i would suggest creating an class SimilarQuestionAnswers that contains a list of bytes and a class UserAnswers that contains a list of these SimilarQuestionAnswers. What this does is it sets up your clusters for the cluster analysis method you mentioned. This way you can add weight to certain clusters. (cluster a is an important cluster so it's score is multiplied by 20 where as cluster b is not as important so its score is only multiplied by 10) This also allows you to apply different comparisons for each cluster if that is needed.

I know you said the questions aren't related but you can still at least group questions by their importance. I think the sequence analysis will still work granted your similarity matrix will be all 1's so that kinda simplifies the problem a bit, but the rest of the math associated with that should be sufficient.

Comparison Applied: This is where having the database back end comes in handy. Use SQL queries to minimize the dataset you are dealing with. If you are comparing one person with everyone else, you can use the SQL sum method on their answers to get a quick and dirty comparison within each cluster and pull only those within a certain threshold. This may result in some overlap but that can be eliminated easily.

Another thought is to also have a table with each user and a column for each cluster with a comparison to a fake user that has answered true to each question. Then you could just query that table for a range around the current users scores for each cluster. This my be faster but less accurate.

Either way in the end you will still need to do the comparison to each of the users you get from that query. So the faster you can make that comparison the better. Try to stick to a formula that involves only +,-,*,/ most of the Math.Whatever() methods can add a lot of time over a large number of calls.

Sorry this was so long, most of the questions were pretty open ended and I had to assume a few details. I hope this helps.

Jack 2010-07-23 17:53:34

Thanks, some really useful ideas in there. I think there's potential in the idea of using a "fake user" or "control user" as a way of quickly comparing distance (similarity). However two users might have the same value of d (distance from control) yet have answered very differently. I think you might need to compare every user individually in order to build up a true comparison.

gomezuk 2010-07-23 18:55:59

I agree that you still need to do a final comparison, I only meant for the control user comparison to be a rough cut to make your dataset you are doing the final comparison on smaller and more manageable. I assume no one user is really going to look at all n comparisons, probably just the top 5% if that.

Jack 2010-07-23 19:10:38

Answer 3

+1 A:

I would think you might want a per-question weight that was based on how all users responded. As an extreme case, if 1,000 people answered questions A & B, and the results were A (2Y, 998N) and B (500Y, 500N), the two 'Ys for A count much more than any given pair of Y's from B. And any similar pair from B is somewhat more similar than any pair of Ns from A.

Check out Bayesian Probability

Carl Manaster 2010-07-23 18:00:48

I think you're absolutely right. In other words, for any given users' answer comparison, the less probable the matching answers are, the higher the similarity score should be. I could store a weight with each answer on the database that gets updated whenever the question is answered.

gomezuk 2010-07-23 19:24:37

Answer 4

+1 A:

Rather than cluster the users, you might also consider clustering the questions (e.g. OkCupid). Then instead of comparing users on all answers, you compare them on the categories.

Justin K 2010-07-23 18:13:41

Could you explain a bit more about what you mean? I've had a look at OKCupid and it's very similar to what I'm planning to do. Do you know which classification system they use?

gomezuk 2010-07-24 16:19:56

I imagine they manually classified questions by topic when the site was small and now have some automated way of doing it, but I don't have any inside knowledge.

Justin K 2010-07-26 13:57:02

ansaurus

tags:

views:

answers:

How do I write a function to compare and rank many sets of boolean (true/false) answers?

related questions