views:

129

answers:

3

I've to create a dataset from some text files, writing them as vectors of features.

Something like this:

doc1: 1,0.45 6,0.001 94,0.1 ...

doc2: 3,0.5 98,0.2 ...

...

each position of the vector represent a word, and the score is given by something like TF-IDF.

Do you know some library/tool/whatever for this? (java is better)

A: 

mallet. including TF-IDF, POS, classification.

Yin Zhu
A: 

Sure there are many eg http://en.wikipedia.org/wiki/Lucene

However

I recommend that you write an basic IR system from scratch. Looking under the hood is always a great learning experience.

Darknight
i know, but my time is finite and TFIDF looks pretty easy to implement
BigG
i didn't mean just the TFIDF algorithm, I meant end to end, from file parsing, indexing to searching/ranking etc.
Darknight
+1  A: 

After some days i found the "perfect tool" for this: Word Vector Tool. http://sourceforge.net/projects/wvtool/

BigG