ansaurus

Question

Euclidian distance between posts based on tags

Answer 1

+1 A:

Okay, first off, your code looks incomplete: I see only one return from your function. I think you mean something like this:

def sim_distance(prefs, person1, person2): 
  # Get the list of shared_items
  p1, p2 = prefs[person1], prefs[person2]
  si = set(p1).intersection(set(p2))

  # Add up the squares of all the differences 
  matches = (p1[item] - p2[item] for item in si)
  return sum(a * a for a in matches)

Next, your post needs a bit of editing for clarity. I don't know what this means: "this becomes 0 cause tags don't have weight same tags has ranking 1."

Lastly, it would help if you provided sample data for prefs[person1] and prefs[person2]. Then you could tell what you are getting and what you expect to get.

Edit: based on my comment below, I would use code like this:

def sim_distance(prefs, person1, person2):
    p1, p2 = prefs[person1], prefs[person2]
    s, t = set(p1), set(p2)
    return len(s.intersection(t)) / len(s.union(t))

hughdbrown 2009-12-09 23:48:59

what i menat is assuming 2 posts share the tag (tag1) as the only similar tag. then (p1[item] - p2[item] for item in si) every item in si will be 0 no? cause tags are either 0 or 1 in the shared case they are all 1 then 1 - 1 will be 0.

Hamza Yerlikaya 2009-12-10 01:13:50

The Euclidean distance code is intended to calculate similarity between two things that share a numerical measure. You are applying this to something that has no numerical measure. I would use a variation on Aziz's idea: I'd compare the count of identical elements to count of unique elements in both sets.

hughdbrown 2009-12-10 04:11:54

Answer 2

+1 A:

Basically, tags don't have weights and can't be represented by numerical values. So you can't define a distance between two tags.

If you want to find the similarity between two posts using their tags, I would suggest that you use the ratio of similar tag. For example, if you have

url1 -> tag1 tag2 tag3 tag4
url2 -> tag1 tag4 tag5 tag6

then you have 2 similar tags, representing 2 (similar tags) / 4 (total tags) = 0.5. I think this would represent a good measurement for similarity, as long as you have more than 2 tags per post.

Aziz 2009-12-09 23:50:52

ansaurus

tags:

views:

answers:

Euclidian distance between posts based on tags

related questions