Decision Trees For Document Classification

views:

144

answers:

+1 Q:

Decision Trees For Document Classification

Hi I wanted to know that is it possible to use decision trees for document classification and if yes then how should be the data representation be? I know the use of R package party for Decision Trees.

I doubt it -- at least as typically defined, a decision tree uses a single criterion to specify a sub-branch. In classifying documents, you can rarely base much of anything on a single criterion -- you need multiple criteria, and even then you don't get a clear-cut tree-like decision, but a "this is a bit closer to that than the other thing" kind of result.

Jerry Coffin 2010-06-25 00:02:51

I think the OP is referring more to classification trees than decision trees. Some ambiguity in the terminology here.

Matt Parker 2010-06-25 01:54:45

... as outlined by the second paragraph of this Wikipedia page: http://en.wikipedia.org/wiki/Decision_tree_learning

Matt Parker 2010-06-25 01:55:53

Yes I agree with Matt there is slight ambiguity on my part sorry for that I mean to classify the documents using decision trees.

Neo_Me 2010-06-25 03:35:26

+1 A:

Hi Neo,

One way is to have a huge matrix where each row is a document, and each column is a word. And the values in the cells are the number of times that word showed in that document.

Then, if you are dealing with "supervised learning" case, you should have another column for the classifier, and from there on you can use a command like "rpart" (from the rpart package), to create your classification tree. The command would be entering a formula to rpart, in a similar fashion as you would to a linear model (lm).

If you want, you could also try to first group your words to "groups of words", and then have each column belonging to a different group of words, with a number indication how many words in the document belonged to that group. For that I would have a look at the "tm" package. (If you end up doing something with that, please consider maybe posting about it here, so we could learn from it)

Best, Tal

Tal Galili 2010-06-25 06:25:11

Hi Tal, Thanks for the pointers, actually I did compute a word document matrix and a association matrix of the most commonly co-occurring words. Have to still compute the rpart trees but I am heading in the direction which you have pointed. Also the functions in the tm package were of great help. I will post code over here once I get some results. -Neo

Neo_Me 2010-06-25 14:19:22

My pleasure Neo :)

Tal Galili 2010-06-25 15:07:48

There is one problem with decision trees -- they are prone to overfitting. I would suggest you to try random forest method (av. in randomForest package) which is free from this drawback.

mbq 2010-06-25 22:43:37

Hi mbq, from what I read here: http://en.wikipedia.org/wiki/Random_forest#Disadvantages I see that random forests can also fall in the case of over fitting (and I think this case might be one of them, since I expect there to be many variables which are just noise). What do you think? Tal

Tal Galili 2010-06-26 02:37:11

This statement is based on a very specific attempt to break RF, and it shows behavior that is shared with other classifiers. Even more, in case of high level of noise RF's attribute importance measure works pretty well and may be used to clear the set and improve accuracy.

mbq 2010-06-27 14:22:02

+1 A:

This paper gives a survey of different text categorization techniques and their accuracies. In short, you can categorize text with decision trees, but there are other algorithms that are much better.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, cs.IR/0110053v1. Available from: http://arxiv.org/abs/cs.IR/0110053v1.

Ken Bloom 2010-06-30 01:10:26

ansaurus

tags:

views:

answers:

Decision Trees For Document Classification

related questions