ansaurus

Question

Suppose I have 2 vectors. What algorithms can I use to compare them?

Answer 1

+2 A:

Martin Beckett 2009-11-26 22:44:21

Answer 2

A:

You could use the set_intersection algorithm. The 2 vectors must be sorted first (use sort call), then pass in 4 iterators and you'll get a collection back with the common elements inserted into it. There are a few others that operate similarly.

gbjbaanb 2009-11-26 22:48:09

set_intersection will not work as it will not pick up "similar" terms. Also, alex said that positioning and ordering matter. Therefore sorting the lists will be a BAD idea

inspectorG4dget 2009-11-26 23:21:26

Answer 3

+3 A:

I would suggest you a book called Programming Collective Intelligence.
It's a very nice book on how you can retrieve information from simple data like this one. There are code examples included (in Python :)

Edit: Just replying to gbjbaanb: This is Python!

a = ['books','video','photography','food','toothpaste','burgers']
b = ['video','processor','photography','LCD','power supply', 'books']
a = set(a)
b = set(b)

a.intersection(b)
    set(['photography', 'books', 'video'])

b.intersection(a)
    set(['photography', 'books', 'video'])

b.difference(a)
    set(['LCD', 'power supply', 'processor'])

a.difference(b)
    set(['food', 'toothpaste', 'burgers'])

ikkebr 2009-11-26 22:48:30

Thanks for the set/intersections. However, I'm sure some "Math" and "statistics" can be applied to them...with some numbers, right?

TIMEX 2009-11-26 22:55:14

And thanks for the book recommendation. I just purchased it :)

TIMEX 2009-11-26 22:55:45

Answer 4

A:

As mbg mentioned, the hamming distance is a good start. It's basically assigning a bitmask for every possible item whether it is contained in the companies value.

Eg. toothpaste is 1 for company A, but 0 for company B. You then count the bits which differ between the companies. The Jaccard coefficient is related to this.

Hamming distance will actually not be able to capture similarity between things like "video" and "photography". Obviously, a company that sells one does sell the other also with higher probability than a company that sells toothpaste.

For this, you can use stuff like LSI (it's also used for dimensionality reduction) or factorial codes (e.g. neural network stuff as Restricted Boltzman Machines, Autoencoders or Predictablity Minimization) to get more compact representations which you can then compare using the euclidean distance.

bayer 2009-11-26 22:59:01

Answer 5

+3 A:

Is position is very important, as you emphasize, then the crucial metric will be based on the difference of positions between the same items in the different vectors (you can, for example, sum the absolute values of the differences, or their squares). The big issue that needs to be solved is -- how much to weigh an item that's present (say it's the N-th one) in one vector, and completely absent in the other. Is that a relatively minor issue -- as if the missing item was actually present right after the actual ones, for example -- or a really, really big deal? That's impossible to say without more understanding of the actual application area. You can try various ways to deal with this issue and see what results they give on example cases you care about!

For example, suppose "a missing item is roughly the same as if it were present, right after the actual ones". Then, you can preprocess each input vector into a dict mapping item to position (crucial optimization if you have to compare many pairs of input vectors!):

def makedict(avector):
  return dict((item, i) for i, item in enumerate(avector))

and then, to compare two such dicts:

def comparedicts(d1, d2):
  allitems = set(d1) | set(d2)      
  distances = [d1.get(x, len(d1)) - d2.get(x, len(d2)) for x in allitems]
  return sum(d * d for d in distances)

(or, abs(d) instead of the squaring in the last statement). To make missing items weigh more (make dicts, i.e. vectors, be considered further away), you could use twice the lengths instead of just the lengths, or some large constant such as 100, in an otherwise similarly structured program.

Alex Martelli 2009-11-27 00:18:47

Alex Martelli-I know you are a well known author in the Python world. Do you know any good libraries that can deal with "collective data" like this? (besides NLTK, numpy)

TIMEX 2009-11-27 09:58:02

@alex, besides what's already been mentioned, an important resource is R -- it's not a library, it's a language of its own (specialized in advanced statistical manipulation), but there's a decent "bridge" to interface it to/from Python.

Alex Martelli 2009-11-27 16:22:12

Answer 6

A:

pick the rank of each entry (higher rank is better) and make sum of geometric means between matches

for two vectors

sum(sqrt(vector_multiply(x,y)))  //multiply matches

Sum of ranks for each value over vector should be same for each vector (preferrebly 1) That way you can make compares between more than 2 vectors.

If you apply ikkebr's metod you can find how a is simmilar to b

in that case just use

sum( b( b.intersection(a) ))

ralu 2009-11-27 00:32:07

ansaurus

tags:

views:

answers:

Suppose I have 2 vectors. What algorithms can I use to compare them?

related questions