views:

190

answers:

3

(This is no homework and no work issue. It's just my personal interest/occupation and completly fictional. But I am interested in a good algorithm or data structure.)

Let's assume, that I would run a dating site. And my special feature would be that the singles were matched by movie taste. (Why not?)

In that case I would need a way to store the movie ratings for each user. (So far no problem.) And I would need a data structure to find the best fitting user. The distance between two taste patterns would be the average distance between all ratings that both users made.

Example

movies   A B C D E F G H I J K L M ...
user Xm  9 5   1   1   5
user Ym      4 6 1         8
user Zf  9   6 4           7

Distance(X,Z) = avg( abs(9-9) + abs(1-4) ) = 1.5

Distance(Y,Z) = avg( abs(4-6) + abs(6-4) + abs(8-7) ) = 1.666

So Mr. X fits slightly better to Mrs. Z, than Mr. Y does.

I like soulution that ...

  • ... don't need many operations on the database
  • ... don't need to handle a lot of data
  • ... run fast
  • ... deliver the best matching
  • Ok, maybe I would consider good approximations too.

Try to keep in mind that this should also work with thousands of possible movies, users that rate only about 20-50 movies, and thousands of users.

(Because this is a mental puzzle and not a real problem, work-arrounds are not really helping.)

What would be your search algorithm or data structure?

A: 
CREATE TABLE data (user INTEGER, movie INTEGER, rate INTEGER);

SELECT  other.user, AVG(ABS(d1.rate - d2.rate)) AS distance
FROM    data me, data other
WHERE   me.user = :user
    AND other.user <> me.user
    AND other.movie = me.movie
GROUP BY
    other.user
ORDER BY
    distance

Complexity will be O(n1.5)) rather than O(n2), as there will be n comparisons to sqrt(n) movies (average of movies filled together by each pair).

Quassnoi
This looks a bit too easy. I think the complexity of this query would be about O(n²). Except the database does some magic, that I am asking for - right here. :) Thank you anyay for your suggestion. (small fix: "AND other.user <> me.user")
MrFox
There are three parameters, the number of people (say n), the number of movies (say m) and the probability that a person fills out a particular movie (say p). This algo is then O(n * m * p^2) expected time, assuming movies are filled out independently (or higher if some movies are more popular).
j_random_hacker
+3  A: 

Looks like you are looking for the nearest neighbor in the movie space. And your distance function is the L1 metric. You can probably use a spatial index of some kind. Maybe you can use techniques from collaborative filtering.

Yuval F
You are right. This is a kind of nearest neighbor with some L1 distance function. And I have already considered a spatial index like z-order or oct-tree. But this would imply a large table (thought) with almost empty cells. Any spacial index would perform bad on such an empty table.
MrFox
+3  A: 

Sounds a lot like the Netflix Prize challenge, more specifically the first half of the most popular approach. The possible implementations of what you are trying to do are numerous and varied. None of them are exceptionally efficient, and the L1 metric is not a particularly good option for reliable correlations.

Sparr