views:

144

answers:

3

Hi I wanted to know that is it possible to use decision trees for document classification and if yes then how should be the data representation be? I know the use of R package party for Decision Trees.

A: 

I doubt it -- at least as typically defined, a decision tree uses a single criterion to specify a sub-branch. In classifying documents, you can rarely base much of anything on a single criterion -- you need multiple criteria, and even then you don't get a clear-cut tree-like decision, but a "this is a bit closer to that than the other thing" kind of result.

Jerry Coffin
I think the OP is referring more to classification trees than decision trees. Some ambiguity in the terminology here.
Matt Parker
... as outlined by the second paragraph of this Wikipedia page: http://en.wikipedia.org/wiki/Decision_tree_learning
Matt Parker
Yes I agree with Matt there is slight ambiguity on my part sorry for that I mean to classify the documents using decision trees.
Neo_Me
+1  A: 

Hi Neo,

One way is to have a huge matrix where each row is a document, and each column is a word. And the values in the cells are the number of times that word showed in that document.

Then, if you are dealing with "supervised learning" case, you should have another column for the classifier, and from there on you can use a command like "rpart" (from the rpart package), to create your classification tree. The command would be entering a formula to rpart, in a similar fashion as you would to a linear model (lm).

If you want, you could also try to first group your words to "groups of words", and then have each column belonging to a different group of words, with a number indication how many words in the document belonged to that group. For that I would have a look at the "tm" package. (If you end up doing something with that, please consider maybe posting about it here, so we could learn from it)

Best, Tal

Tal Galili
Hi Tal, Thanks for the pointers, actually I did compute a word document matrix and a association matrix of the most commonly co-occurring words. Have to still compute the rpart trees but I am heading in the direction which you have pointed. Also the functions in the tm package were of great help. I will post code over here once I get some results. -Neo
Neo_Me
My pleasure Neo :)
Tal Galili
There is one problem with decision trees -- they are prone to overfitting. I would suggest you to try random forest method (av. in randomForest package) which is free from this drawback.
mbq
Hi mbq, from what I read here: http://en.wikipedia.org/wiki/Random_forest#Disadvantages I see that random forests can also fall in the case of over fitting (and I think this case might be one of them, since I expect there to be many variables which are just noise). What do you think? Tal
Tal Galili
This statement is based on a very specific attempt to break RF, and it shows behavior that is shared with other classifiers. Even more, in case of high level of noise RF's attribute importance measure works pretty well and may be used to clear the set and improve accuracy.
mbq
+1  A: 

This paper gives a survey of different text categorization techniques and their accuracies. In short, you can categorize text with decision trees, but there are other algorithms that are much better.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, cs.IR/0110053v1. Available from: http://arxiv.org/abs/cs.IR/0110053v1.

Ken Bloom