views:

71

answers:

2

I have 100 Gb of documents. I would like to characterize it and get a general sense of what topics are prevalent.

The documents are plain text.

I have considered using a tool like Google Desktop to search, but it is too large to really guess what to search ask for and too time consuming to perform enough searches to cover the entire set.

Are there any freely available tools that will cluster a large dataset of documents?

Are there any such tools that can visualize such clusters?

A: 

You need to look into tools that do natural language processing. Basically you can quite reliably determine (using statistical tools) the language of a document (see http://en.wikipedia.org/wiki/N-gram) and the domain of discourse (see http://en.wikipedia.org/wiki/Support_vector_machine). Some tools should be available if you start from wikipedia.

Toader Mihai Claudiu
A: 

For a basic NLP approach, you could represent each document as a vector based on word frequencies, then cluster the document vectors using Bayesian or other methods (SVM, k-means, etc).

For related answers, see this somewhat similar SO question.

bubaker