views:

105

answers:

1

Related:

  1. Forum post

Before reinventing the wheel, I need to know whether such method exists. Stripping words according to a list such as list does not sound challenging but there are linguistic aspects, such as which words to stress the most in stripping, how about context?

+2  A: 

What it sounds like you are trying to do is remove the "stop words" from the text. You can find a list of English stopwords at the link. Depending on how many stop words you use, it may be more efficient to create a HashSet of words,so that you can tell whether a word is a stop-word in constant-time (by using the contains() function), which would imply that filtering the entire text would take linear time in the number of words. This is such a simple operation that I doubt you will find some library to do it, but it shouldn't take long.

In terms of choosing which words to use... it really depends on what you are trying to do. If you are performing some sort of machine learning algorithm on the bag of words model, then you really have to try different selections of words and see which ones lead to the least validation error. In terms of the context, a lot of words really aren't needed. Anyone who speaks English well can tell you when you've dropped a "the" or "a" or "an". There may be common words that are important for certain disambiguation, but depending on your application, they may or may not be necessary. For example, if you want to know who did something, then eliminating "he", "she", etc. might be a problem, but if you only care about whether such-and-such an action occured and you don't really care who did it, then eliminating pronouns would be just fine.

Michael Aaron Safyan