views:

357

answers:

4

I am playing with the following code from programming collective intelligence, this is a function from the book that calculated eclidian distance between two movie critics.

This function sums the difference of the rankings in the dictionary, but euclidean distance in n dimensions also includes the square root of that sum.

AFAIK since we use the same function to rank everyone it does not matter we square root or not, but i was wondering is there a particular reason for that?


from math import sqrt 
# Returns a distance-based similarity score for person1 and person2 
def sim_distance(prefs,person1,person2): 
  # Get the list of shared_items 
  si={} 
  for item in prefs[person1]: 
    if item in prefs[person2]: 
       si[item]=1 
  # if they have no ratings in common, return 0 
  if len(si)==0: return 0 
  # Add up the squares of all the differences 
  sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) 
                      for item in prefs[person1] if item in prefs[person2]]) 
  return 1/(1+sum_of_squares) 
+1  A: 

Just for intercomparisons the square root is not necessary and you would get the squared euclidean distance... which is also a distance (mathematically speaking, see http://en.wikipedia.org/wiki/Metric%5F%28mathematics%29).

Pablo Rodriguez
+3  A: 

That's correct. While the square root is necessary for a quantitatively correct result, if all you care about is distance relative to others for sorting, then taking the square root is superfluous.

Milan Ramaiya
+7  A: 

The reason the square root is not used is because it is computationally expensive; it is monotonic (i.e., it preserves order) with the square function, so if all you're interested in is the order of the distances, the square root is unnecessary (and, as mentioned, very expensive computationally).

McWafflestix
+1  A: 

To compute a Cartesian distance, first you must compute the distance-squared, then you take its square root. But computing a square root is computationally expensive. If all you're really interested in is comparing distances, it works just as well to compare the distance-squared--and it's much faster.

For every two real numbers A and B, where A and B are >= zero, it's always true that A-squared and B-squared have the same relationship as A and B:

  • if A < B, then A-squared < B-squared.
  • if A == B, then A-squared == B-squared.
  • if A > B, then A-squared > B-squared.

Since distances are always >= 0 this relationship means comparing distance-squared gives you the same answer as comparing distance.

Larry Hastings