views:

194

answers:

2

I have a real question.

I have a database with the schema as follows:

item

  • id
  • description
  • other junk

tag

  • id
  • name

item2tag

  • item_id
  • tag_id
  • count

Basically, each item is tagged as up to 10 things, with varying counts. There are 50,000 items and 50,000 tags, and about 500,000 entries in items2tag. I'd like to find, given one item, the "most similar" item.

By "most similar" I mean the item that has the most similar combination of tags... if something is "cool" twice as much as it is "funny," I want to find all other things that are almost "cool" twice as much as they are "funny." Of course, this should apply to 10 tags, not just 2.

Any ideas?

+1  A: 

Well, you can look at linear algebra to give a n dimensional vector to each item, and then compute the distance between items to find the closest items, but that's pretty complex with even small data sets.

Which is why Google came up with Map Reduce. This will probably be your best bet, but even then it's non-trivial.

Adam Davis
Can MySQL compute distances between points easily?
A: 

Given your representation of item-tag relationship as vectors, What you have is an instance of nearest-neighbor search. You may find pointers in the field of Collaborative Filtering.

Yuval F