views:

222

answers:

7

I have a large collection of objects and I need to figure out the similarities between them.

To be exact: given two objects I can compute their dissimilarity as a number, a metric - higher values mean less similarity and 0 means the objects have identical contents. The cost of computing this number is proportional to the size of the smaller object (each object has a given size).

I need the ability to quickly find, given an object, the set of objects similar to it.

To be exact: I need to produce a data structure that maps any object o to the set of objects no more dissimilar to o than d, for some dissimilarity value d, such that listing the objects in the set takes no more time than if they were in an array or linked list (and perhaps they actually are). Typically, the set will be very much smaller than the total number of objects, so it is really worthwhile to perform this computation. It's good enough if the data structure assumes a fixed d, but if it works for an arbitrary d, even better.

Have you seen this problem before, or something similar to it? What is a good solution?

To be exact: a straightforward solution involves computing the dissimilarities between all pairs of objects, but this is slow - O(n2) where n is the number of objects. Is there a general solution with lower complexity?

+1  A: 

If your similarity measure is transitive, you don't have to compute the similarity for all pairs of objects since for objects a, b, c:

similarity(a,c) = similarity(a,b) op similarity(b,c)

where op is a binary operator e.g. multiplication or addition.

Sebastian
The op will have to clarify, but when he said "metric" I was thinking http://en.wikipedia.org/wiki/Metric_%28mathematics%29which is, in general, not transitive because of the triangle inequality.
Dan Hook
As per stated, (Objects, similarity) is a metric space, so all you can say about similarity is similarity(a,c) <= (similarity(a,b) + similarity(b,c))
Tordek
@Dan: yes, my "metric" is in fact a link to the same URL.
reinierpost
@Dan: ... er ... well it was in the edit box, but not in the text, due to some crazy bug - fixed
reinierpost
A: 

Can we assume that similarity is transitive, ie. diff(a,c) == diff(a,b) + diff(b,c)? If so, you can try the following:

  1. Sort the collection of objects. If the object similarity metric doesn't have a decent absolute value, you can arbitrarily select one object as "zero" and sort all other objects by their similarity to that object.
  2. To find the objects with similarity s to o, find o in the sorted list, and search to the left and to the right until the diff grows larger than s.

The advantage of this is that the sorting can be done once, and subsequent set building is proportional to the number of members that will be in the set.

JSBangs
No. Metrics aren't transitive.
Tordek
It isn't transitive. Consider what happens if a and c are identical. Your formula would yield 2*diff(a,b) when the value should be zero.
Jerry Coffin
Whether this work depends on transitivity, and the question doesn't give enough information to say. If the "difference" is, for example, the signed difference in height between pairs of people, then it would be transitive. If it's more like, number of features that two products share selected from a list of relevant features, then it would not be transitive at all.
Jay
@Jay:Yes, the question does provide enough to say that it's not transitive:"given two objects I can compute their dissimilarity as a number, a metric - higher values mean less similarity and 0 means the objects have identical contents."
Jerry Coffin
It's good to ask for any additional properties of the problem, especially when they may be essential to a good solution. But no, my differences can't be added in this way. (Consider what happens if you swap b and c.)
reinierpost
+2  A: 

I need to produce a data structure that maps any object o to the set of objects no more dissimilar to o than d, for some dissimilarity value d.

It might be fastest to just abandon the similarity computation when the subtotal becomes larger than d. For example, if your similarities are based on cosine or hausdorff distances this can easily be done.

 

PS: if this cannot be done, your problem might be related to the k-nearest neighbors problem (or more precise a nearest neighbor problem with a threshold neighborhood). You should look for algorithms that find close-by members without computing all distances (maybe something using triangle inequality). Wikipedia should help you to explore suitable algorithms.

Adrian
I may be missing something, but I don't see how the k-nearest neighbors algorithm applies. It appears to be a classification algorithm that assumes distances are known, not a fast way to calculate those distances.
Dan Hook
There is a class of knn algorithms that find the nearest neighbors *without* computing all pairwise distances. Depends on your metric space though, and how many assumption you can take.
Adrian
@Adrian: please provide a link for clarity
Misha
See for example kd-tree, however whether this is applicable or not depends on the space of OP's problem. So it's good that you asked OP to provide examples.
Adrian
Thanks for mentioning kd-trees and the k-nearest neighbors problem. My problem is not about distances in 3D space or any other space that I can think of.
reinierpost
If you have a metric, you have a space. There are many weird spaces that are nothing like what you would traditionally think of as "space" http://en.wikipedia.org/wiki/Metric_space
Dan Hook
A: 

Without knowing more details of the metric, it's hard to say. I don't have any ideas for eliminating the O(n^2) aspect, but there may be a way to reduce some of the constants involved. For example, if you had a Euclidean metric d(p,q) = sqrt( (p_1-q_1)^2 + ..+ (p_n-q_n)^2), you could square your distance d and compare it to the partial sums of (p_i-q_i)^2 and stop when you exceed d^2.

Whether this will actually save you time depends on how expensive the compare is to just calculating the summands and how many summand calculations you could expect to avoid by doing this (obviously, the smaller d is, the better).

Dan Hook
Good idea. In fact I have some ideas for "approximating" the node values in ways that roughly respect the distance metric while making its computation much faster, and these can be used to speed up the calculation, but I thought the question complex enough as it is.
reinierpost
+1  A: 

I think the solution depends on a lot more detail about the nature of your problem.

  1. Do you need to find the similar objects for the same object many times, or only once? If it's many times, then creating a data structure where you compute the difference once for each pair and then connect objects to similar objects so that you can retrieve the list quickly without recalculation might be a very useful performance enhancement.

  2. What is the nature of the calculation? At one extreme, if the nature of the difference is that it is, for example, the difference in height between two people, then maintaining the list sorted by height would let you find the similar objects very quickly. I'm assuming the real problem is more complicated than that, but following on that logic, if the difference is the sum of several linear quantities, you could create a multi-dimenstional array, and then conceptually imagine the set of similar objects as those within an n-dimensional sphere (i.e. circle, sphere, hypersphere, etc) centered around the reference object, and again find them directly. Actually it occurs to me that if the radius calculations are too complicated or take too much run-time, a good approximation would be to create an n-dimensional cube (i.e. square, cube, tesseract, etc) around the reference object, retrieve all objects which lie within that cube as "candidates", and then just do the actual computation on the candidates.

For example, suppose the "difference" is the sum of the absolute values of the differences of three attributes, say a1, a2, and a3. You could create a 3-dimensional array and set the value of each node of the array to the object with those values, if any. Then if you want to find all objects with difference less than d from object o, you could write:

for (x1=o.a1-d;x1<o.a1+d;++x1)
{
  for (x2=o.a2-d;x1<o.a2+d;++x2)
  {
    for (x3=o.a3-d;x1<o.a3+d;++x3)
    {
      if (array[x1][x2][x3]!=null
        && (abs(x1-o.a1)+abs(x2-o.a2)+abs(x3-o.a3)<=d)
        {
          ... found a match ...
        }
    }
  }
}

I suspect that the difference rules are more complicated than that, but fine, just add sophistication to the alrorithm to match the complexity of the rules. The point is to use the array to limit the set of objects that you have to examine.

  1. Again on the nature of the calculation: If one of the elements making up the difference, or some small subset, tends to be more significant than others, then create a data structure that allows you to quickly compare for this within range. If it is in range, do the full compare. If not, then you don't even look at it.
Jay
@1: Yes, I need to look up the neighbors more than once.@2: Yes, such assumptions would simplify the problem and no, the ones you suggest here do not apply. I will post a followup question with a more specific form of my question.
reinierpost
+1  A: 

Is it not possible to use a kd-tree?

It may be necessary (if possible) to normalize the dimensions. Afterwards, you just need to populate the tree, and use a "nearest N neighbors" search, and try to find any object within some range.

Tordek
kd-tree requires a metric space with axes (and the ability split it up), alas OP did not tell us whether has problem has that property.
Adrian
It doesn't, it's one of the things that makes it hard.
reinierpost
+1  A: 

Example of objects: Images, Documents. Of course working with the raw representation of these objects is mostly not useful. usually one would pre-process the raw form and turn it into some normalized form (for documents, say a vector for which each entry represents the number/percent of times a certain word appeared, for images it could be a representation of visual features found in the image).

if d is fixed and a n^2 pre-computation is feasible, you could just use a graph representation using a linked list for each object for example. You can have more efficient solutions on the expense of accuracy using approximate nearest neighbors algorithms.

liza
This is the best approach I've found thus far. Thanks.
reinierpost