Typically these words will appear in documents with the highest frequency.
Assuming you have a global list of words:
{ Word Count }
With the list of words, if you ordered the words from the highest count to the lowest, you would have a graph (count (y axis) and word (x axis) that is the inverse log function. All of the stop words would be at the left, and the stopping point of the "stop words" would be at where the highest 1st derivative exists.
This solution is better than a dictionary attempt:
- This solution is a universal approach that is not bound by language
- This attempt learns what words are deemed to be "stop words"
- This attempt will produce better results for collections that are very similar, and produce unique word listings for items in the collections
- The stop words can be recalculated at a later time (with this there can be caching and a statistical determination that the stop words may have changed from when they were calculated)
- This can also eliminate time based or informal words and names (such as slang, or if you had a bunch of documents that had a company name as a header)
The dictionary attempt is better:
- The lookup time is much faster
- The results are precached
- Its simple
- Some else came up with the stop words.