views:

173

answers:

6

Hello I would like to know how to implement the solution to such a task:

There's a 500Mb file of plain English texts.

I'd like to collect the statistics about the frequency of words, but additionally to be sure that each word is recognized correctly (or the majority of words).

In terms that 'cry' in the sentence "she gave a loud CRY" would be classified as a noun and "Do not cry" would give a verb to the statistics.

Also, it would be good to filter proper names, so that they formed another dictionary.

The other task would be more difficult. I would like to find the occurences of words that come together quite often and to build the list of such occurences.

Let's say, "green grass", "beautiful girl", "handle carefully", "you are right". So that we could say exactly, what word sequences are often used together in the language.

HOW WOULD I START? Are there open Java tools & good books on the subject?

A: 

Take a look at GROK.

Adamski
+3  A: 

An excellent introduction to these topics is Foundations of Statistical Natural Language Processing.

Foundations of Statistical Natural Language Processing

On the software side, you could look at things like the Stanford Part-Of-Speech Tagger or LingPipe.

Fabian Steeg
"Foundations.." is a huge book, isn't it?
EugeneP
@EugeneP: True, but it covers the topics you're interested in very well and you asked for books specifically :-). Also you can focus on the chapters you're interested in, no need to read it cover to cover.
Fabian Steeg
@Fabian Steeg. thanks
EugeneP
A: 

Your "other task" which "would be more difficult" is far simpler than the original task of differentiating cry(v) from cry(n). What you are trying to do, is generate a concordance (handy search term). Tools do exist to do this for you, and I'd be surprised, given the popularity of English, if you can't find one that will even handle inflections for you, without you having to do any of the hard work.

Paul Butcher
@Paul Butcher Can you give me some links? Even if it's not in Java.
EugeneP
A: 

Your "other task" seems to be just a Markov chain problem. If you are interested in combinations of two words, you just need to read through your text a word at a time, creating a dictionary (hash, table, whatever) where the key is the current and previous word, and the value is the count.

So for input text "home is where the home is", you get

nil, home: 1   (ignore this)
home, is: 2
is, where: 1
where, the: 1
the, home: 1
Shadowfirebird
No, the combination may contain extra words, that we must filter out, such as in "I've seen her already" = have..already; "I've already seen her"=have..already.
EugeneP
+1  A: 

You might want to take a look at http://l2r.cs.uiuc.edu/~cogcomp/asoftware.php?skey=FLBJPOS as well.

reprogrammer
Thanks a lot, I'm looking for something already implemented to see how things work in real life. Every implemented decision is welcome.
EugeneP
It is implemented. You can download the POS Tagger from the link I provided above.
reprogrammer
A: 

You might be interested in Introduction to Linguistic Annotation and Text Analytics, a book which is focused very heavily on software tools for text annotation and text analysis. It has no focus whatsoever on natural language processing theory, but can serve as a good introduction to current NLP software tools.

(Be forewarned that becuase of this focus, it will probably become obsolete very fast. If you can borrow it from a library, you probably should do that instead of buying it.)

Ken Bloom