Hi I wanted to know that is it possible to use decision trees for document classification and if yes then how should be the data representation be? I know the use of R package party for Decision Trees.
I doubt it -- at least as typically defined, a decision tree uses a single criterion to specify a sub-branch. In classifying documents, you can rarely base much of anything on a single criterion -- you need multiple criteria, and even then you don't get a clear-cut tree-like decision, but a "this is a bit closer to that than the other thing" kind of result.
Hi Neo,
One way is to have a huge matrix where each row is a document, and each column is a word. And the values in the cells are the number of times that word showed in that document.
Then, if you are dealing with "supervised learning" case, you should have another column for the classifier, and from there on you can use a command like "rpart" (from the rpart package), to create your classification tree. The command would be entering a formula to rpart, in a similar fashion as you would to a linear model (lm).
If you want, you could also try to first group your words to "groups of words", and then have each column belonging to a different group of words, with a number indication how many words in the document belonged to that group. For that I would have a look at the "tm" package. (If you end up doing something with that, please consider maybe posting about it here, so we could learn from it)
Best, Tal
This paper gives a survey of different text categorization techniques and their accuracies. In short, you can categorize text with decision trees, but there are other algorithms that are much better.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, cs.IR/0110053v1. Available from: http://arxiv.org/abs/cs.IR/0110053v1.