views:

375

answers:

4
I love to eat chicken.
Today I went running, swimming and played basketball.

My objective is to return FOOD and SPORTS just by analyzing these two sentences. How can you do that?

I am familiar with NLP and Wordnet. But is there something more high-level/practical/modern technology??

Is there anything that automatically categorizes words for you, into "levels"?

More importantly, what is the technical term for this process?

+1  A: 

Google Sets does some of this, and there is some discussion that mentions supersets. However, I have not really seen any technical details in there, just ideas and discussion.

Maybe this could at least help your research...

Doug L.
I entered the items on my wife's bedside table and it came up with 'terrorism'
Pete Kirkham
We're watching you, Kirkham.
Jonathan Feinberg
+2  A: 

That problem is difficult to solve procedurally, but much progress has been made in the area lately.

Most natural language processing begins with a grammar (which may or may not be context free.) Its a set of construction rules stating how more general things are made out of more specific ones.

example context free grammar:

Sentence ::= NounPhrase VerbPhrase
NounPhrase ::= ["The"] [Adjective] Noun
Adjective ::= "big" | "small" | "red" | "green"
Noun ::= "cat" | "man" | "house"
VerbPhrase ::= "fell over"

This is obviously oversimplified, but the task of making a complete grammar to define all of english is enormous, and most real systems only define some subset of it applicable to a problem domain.

Once a grammar has been defined, (or learned using complicated algorithms known only to the likes of Google) a string, called an "exemplar" is parsed according to the grammar. which tags each word with the parts of speech. a grammar that is very complex would not just have the parts of speech you learned in school, but categories such as "Websites" "Names of old people" and "ingredients".

These categories can be laboriously built into the grammar by humans or inferred using things like Analogical Modeling or Support Vector Machines. In each, things like "chicken", "football", "BBQ", and "cricket" would be defined as points in a very high dimensional space, along with millions of other points, and then the clustering algorithms, would define groups just based on the positions of those points relative to each-other. then one might try to infer names for the groups from example text.

link text This Google search lists several techniques used in NLP, and you could learn a whole lot from them.

EDIT to just solve this problem, one might crawl the web for sentences of the form "_ is a _" to build up a database of item-category relationships. then you parse a string like above, and look for words that are known items in the database

Nathan
+1  A: 

The question you ask is a whole area of research called topical text categorization. A great overview of techniques is "Machine learning in automated text categorization" in ACM Computing Surveys, by Fabrizio Sebastiani.. One of the simplest techniques (though not necessarily the best performing) is to have numerous (hundreds) examples of sentences in each category, and then train a Naive Bayesian classifier on those sample sentences. NLTK contains a Naive Bayesian classifier in the module nltk.classify.naivebayes.

Ken Bloom
A: 

You might take a look at WordNet Domains resource by people from FBK. It is an extension of WordNet which is designed to be used for text categorization and word sense disambiguation. It allows varying degrees of granularity.

http://wndomains.fbk.eu/

One of the possible ways to apply it to your task might be to get NP-chunks out of your sentences, get their head words and from them get the categories from WordNet domains.

Aliaksandr Autayeu