In order to perform a simple clustering algorithm on results that I get from Lucene, I have to calculate Cosine similarity between 2 documents in Lucene, I also need to be able to make a centroid document to represent the centroid of each cluster.
All I can think of doing is building my own Vector Space model with tf-idf weighting, using the TermFreqVectors and Overall Term frequencies to populate it.
My question is: This is not an efficient approach, is there a better way to do this?
This feels a little unclear so any suggestions on how I can improve my question are also appreciated.