views:

254

answers:

2

My program uses clustering to produce subsets of similar items and then uses the cosine similarity measure as a method of determining how similar the clusters are. For instance if user 1 has 3 clusters and user 2 has 3 clusters then every cluster is compared against each other, 9 results using the cosine similarity measure will be produced, e.g. [0.3, 0.1, 0.4, 0.12, 0.0, 0.6, 0.8, 1.0, 0.22]

My problem is, based on these results how can I turn these values into a tangible result to show how similar these two users are?

A simple method I produced was to just divide all the values by the number of comparisons and add them together to get 1 value but this is quite a simple approach.

Thanks,

AS

+1  A: 

The problem is poorly defined... With more details it may be possible to offer commentary about the validity of the approach, in general (that of using Cosine Similarity, of the way it is calculated etc.) as well as the validity of the approach used in aggregating the final result.

Essentially, you are averaging the Cosine Similarity values computed for each pair of clusters (Ca, Cb) where Ca is a cluster which user A "has" and Cb a cluster which B "has".

I'm guessing this could be greatly improved by using a weighted average which would take into account the amount of "having" of a cluster that a user can exhibit.
Maybe this "having" relationship is purely Boolean: either a user has or doesn't have a particular cluster, but odds are good that his/her "having" can be qualified with either an [ordered] categorical attribute or even a numerical value (be it relative : say a percentage of having of a given cluster a given user has, compared to the other clusters he/she has, or be it absolute).
Because each Cosine Similarity is based on a clusters which user "A" has and a cluster which user "B" has, if properly normalized it could be possible to take the product of the corresponding "having" measures as a coefficient applied to the corresponding Cosine Similarity term in the average computation. In this fashion, if two users are effectively similar but one of them happens to have an extra cluster or two, with very low "having" factors, the aggregate result won't suffer much from this.

Generally distance computation (such as with Cosine Similarity) as well as aggregation formulas (such as the average or weighed average) are very sensitive to the scale of the individual dimensions (and to their relative "importance"). For this reason it is often hard to provide but generic advice such as the above. Theory matters very much with classification problems, but one needs to be be mindful of not applying formulas "blindly": it's easy to loose the forest for the tree ;-)


To help improve the question, here's what I generally understand, please complement and correct the question to provide a better "feel" for what it is you are trying to achieve and what the characteristics of the system are, so that you may receive better suggestion.
We have items which we assume are vector-like objects and which are assigned to clusters. The subset keyword hints that that each item probably belongs to one and only one cluster (or possibly to no cluster at all) but it would be good to confirm that this is the case.
Also it would be good to get an idea of the fact that the dimensions of the vectors are somehow normalized (lest a relatively unimportant characteristic of items, but with a relatively big range of value skews the Cosine Similarity or other distance measurements)
We have users which can "have" several clusters. It would be good to know (in the main lines) how a given user comes to "have" clusters and if their having cluster is only a boolean property (to have or not to have) or if there is some categorical or even numerical measure of the "having" (User X has cluster 1 with a coef of .3 and cluster 8 with a coef of .2 etc...)
The way the Cosine Similarity between two clusters is measured could also be better defined (is it the similarity between the two "centers" of the clusters or is it something else...

mjv
A: 

Thank you for replying mjv, and I apologise for the lack of context in my question.

The basic description of what I am trying to achieve is whether it is possible to determine how similar two users, from the social bookmarking webservice Delicious.com, from their bookmarks and tags.

Thus far I have created clusters from the tags of a users bookmarks and the co-occurrences of each tag, for instance one cluster could be:

fruit: (apple, 15), (orange, 9), (kiwi, 2)

and another user may have a similar cluster produced from their tags:

fruit: (apple, 12), (strawberry, 7), (orange, 3)

The number represents how many times the tag co-occurred, in a saved bookmark, with the tag, "fruit" in this example.

I have used the cosine similarity measure to compare these clusters to determine how similar they are, and from my initial question, with many cluster comparison results (comparing every users clusters against another users clusters) I am unsure how to aggregate the results to producing a meaningful result.

I hoped that helped you understand better,

It's very possible that I have been using the cosine similarity improperly,

Thanks,

AS

anotherstat