views:

63

answers:

2

Hi,

Having looked around this site for similar issues, I found this: http://math.nist.gov/javanumerics/jama/ and this: http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html

However, it seems these run in O(n^2). I've been doing some document clustering and noticed this level of complexity wasn't feasible when dealing with even small document sets. Given, for the dot product, we only need the vector terms contained in both vectors it should be possible to put the vectors in a tree and thus compute the dot product with n log n complexity, where n is the lowest number of unique terms in 1 of the 2 documents.

Am I missing something? Is there a java library which does this?

thanks

+2  A: 

If you store the vector elements in a hashtable, lookup is only log n anyway, no? Loop over all keys in the smaller document and see if they exist in the larger one..?

Nicolas78
Any class you would recommend? I figure this one is pretty good, if memory is an issue: http://www.java2s.com/Code/Java/Collections-Data-Structure/Amemoryefficienthashmap.htm
Ash
Wow can't judge this so quickly, but you can always go with a normal java.util.HashMap to begin with. Btw since you're saying it's an effect of document collection size: If you compare each document to each document, you have another quadratic term (now in the number of documents) wrapped around the (n*log n) term. For me, this part has often been far more problematic than the actual cosine computation. Could this be the case for you as well?
Nicolas78
I do trimming on the cluster set to get the comparison down to a constant, but for something like GAHC you're completely correct, you have an n^2 problem, where n is the number of clusters to be compared.
Ash
+1  A: 

Hashmap is good, but it might take a lot of memory.

If your vectors are stored as key-value pairs sorted by key then vector multiplication can be done in O(n): you just have to iterate in parallel over both vectors (the same iteration is used e.g. in merge sort algorithm). The pseudocode for multiplication:

i = 0
j = 0
result = 0
while i < length(vec1) && j < length(vec2):
  if vec1[i].key == vec2[j].key:
    result = result + vec1[i].value * vec2[j].value
  else if vec1[i].key < vec2[j].key:
    i = i + 1
  else
    j = j + 1
dmitry_vk
I like this idea, thanks. Is there a java library which uses this principle?
Ash
I don't know; but lucene (http://lucene.apache.org/java/docs/index.html) might contain such algorithm.
dmitry_vk
Thanks dmitry-vk, it seems a sorted map would probably be best: http://java.sun.com/j2se/1.4.2/docs/api/java/util/SortedMap.html
Ash