Java: remove-common-words-method in the API?

What it sounds like you are trying to do is remove the "stop words" from the text. You can find a list of English stopwords at the link. Depending on how many stop words you use, it may be more efficient to create a HashSet of words,so that you can tell whether a word is a stop-word in constant-time (by using the contains() function), which would imply that filtering the entire text would take linear time in the number of words. This is such a simple operation that I doubt you will find some library to do it, but it shouldn't take long.

In terms of choosing which words to use... it really depends on what you are trying to do. If you are performing some sort of machine learning algorithm on the bag of words model, then you really have to try different selections of words and see which ones lead to the least validation error. In terms of the context, a lot of words really aren't needed. Anyone who speaks English well can tell you when you've dropped a "the" or "a" or "an". There may be common words that are important for certain disambiguation, but depending on your application, they may or may not be necessary. For example, if you want to know who did something, then eliminating "he", "she", etc. might be a problem, but if you only care about whether such-and-such an action occured and you don't really care who did it, then eliminating pronouns would be just fine.

ansaurus

tags:

views:

answers:

Java: remove-common-words-method in the API?

related questions