ansaurus

Question

Cosine Similarity of Vectors, with < O(n^2) complexity

Answer 1

+2 A:

If you store the vector elements in a hashtable, lookup is only log n anyway, no? Loop over all keys in the smaller document and see if they exist in the larger one..?

Nicolas78 2010-07-27 18:13:05

Any class you would recommend? I figure this one is pretty good, if memory is an issue: http://www.java2s.com/Code/Java/Collections-Data-Structure/Amemoryefficienthashmap.htm

Ash 2010-07-27 18:18:17

Wow can't judge this so quickly, but you can always go with a normal java.util.HashMap to begin with. Btw since you're saying it's an effect of document collection size: If you compare each document to each document, you have another quadratic term (now in the number of documents) wrapped around the (n*log n) term. For me, this part has often been far more problematic than the actual cosine computation. Could this be the case for you as well?

Nicolas78 2010-07-27 18:32:55

I do trimming on the cluster set to get the comparison down to a constant, but for something like GAHC you're completely correct, you have an n^2 problem, where n is the number of clusters to be compared.

Ash 2010-07-27 21:34:19

Answer 2

+1 A:

Hashmap is good, but it might take a lot of memory.

If your vectors are stored as key-value pairs sorted by key then vector multiplication can be done in O(n): you just have to iterate in parallel over both vectors (the same iteration is used e.g. in merge sort algorithm). The pseudocode for multiplication:

i = 0
j = 0
result = 0
while i < length(vec1) && j < length(vec2):
  if vec1[i].key == vec2[j].key:
    result = result + vec1[i].value * vec2[j].value
  else if vec1[i].key < vec2[j].key:
    i = i + 1
  else
    j = j + 1

dmitry_vk 2010-07-27 19:22:05

I like this idea, thanks. Is there a java library which uses this principle?

Ash 2010-07-27 21:31:05

I don't know; but lucene (http://lucene.apache.org/java/docs/index.html) might contain such algorithm.

dmitry_vk 2010-07-28 18:09:06

Thanks dmitry-vk, it seems a sorted map would probably be best: http://java.sun.com/j2se/1.4.2/docs/api/java/util/SortedMap.html

Ash 2010-07-29 03:56:14

ansaurus

tags:

views:

answers:

Cosine Similarity of Vectors, with < O(n^2) complexity

related questions