ansaurus

Question

Answer 1

A:

I'm not familiar with TF/IDF, but the process can go wrong in many stages generally:

1, Did you remove stopwords?

2, Did you apply stemming? Porter stemmer for example.

3, Did you normalize frequencies for document length? (Maybe the TFIDF thing has a solution for that, I don't know)

4, Clustering is a discovery method but not a holy grail. The documents it retrieves as a group may be related more or less, but that depends on the data, tuning, clustering algorithm, etc.

What do you want to achieve? What is your setup? Good luck!

ron 2010-02-08 18:34:04

Hi Ron, Thanks for your reply. Yes, I used all the things you mentioned above. I have a big similarity matrix and I want to group all the similar documents now. For example if 10 docments are similar to document 15 so there should be one cluster which contain 11 docs ( 1 to 10 and 15 doc) but clustering work on distance and it group the doc. which has freq. more closers lets say 70% while all of the document in that cluster are different :(Is there any other tech. you can mention?

2010-02-08 20:06:41

The right clustering method depends much on the distribution of your document space. You could try the CURE algorithm, or DENCLUE. Or there are clusterings which work on the graph representation of the connectivity data, like Markov Clustering (http://www.micans.org/mcl/)

ron 2010-02-08 23:31:57

Answer 2

+1 A:

My approach would be not to use pre-calculated similarity values at all, because the similarity between docs should be found by the clustering algorithm itself. I would simply set up a feature space with one column per term in the corpus, so that the number of columns equals the size of the vocabulary (minus stop word, if you want). Each feature value contains the relative frequency of the respective term in that document. I guess you could use tf*idf values as well, although I wouldn't expect that to help too much. Depending on the clustering algorithm you use, the discriminating power of a particular term should be found automatically, i.e. if a term appears in all documents with a similar relative frequency, then that term does not discriminate well between the classes and the algorithm should detect that.

ferdystschenko 2010-02-15 11:59:18

I am sorry but I couldn't understand. For my calculation, I am calculating similarity basis on LSI and VSM. Clustering can't help in similarity. If it does please explain which one and how?

2010-02-16 02:44:52

In clustering, a doc is a point in a feature space and a resp. algorithm groups data points which are close to one another. If the features are word frequencies, then docs that contain the same words, i.e. *similar* docs, will be in the same group(s). This is all clustering is about: grouping similar data points (here: documents), where similarity depends on the features employed. I think chapters 16 to 18 in this book: http://nlp.stanford.edu/IR-book/information-retrieval-book.html will guide you well (it's freely available online).

ferdystschenko 2010-02-16 08:28:01

ansaurus

tags:

views:

answers:

In java - Grouping similar values

related questions