Hello I would like to know how to implement the solution to such a task:
There's a 500Mb file of plain English texts.
I'd like to collect the statistics about the frequency of words,
but additionally to be sure that each word is recognized correctly (or the majority of words).
In terms that 'cry' in the sentence "she gave a loud CRY" would be classified as a noun and "Do not cry" would give a verb to the statistics.
Also, it would be good to filter proper names, so that they formed another dictionary.
The other task would be more difficult. I would like to find the occurences of words that come together quite often and to build the list of such occurences.
Let's say, "green grass", "beautiful girl", "handle carefully", "you are right". So that we could say exactly, what word sequences are often used together in the language.
HOW WOULD I START? Are there open Java tools & good books on the subject?