views:

83

answers:

2

Hi,

Is there a partition of english words into a high level categories like say sports, basketball etc... Its required for my project.

Is this data available somewhere? I am okay with overlapping of words across categories.

Thank you Bala

+4  A: 

WordNet

The hierarchy of hypernyms/hyponyms in WordNet will give you categorizations of words at varying degrees of specificity.

Borrowing the example from Wikipedia, you can extract the following set of categories for the word "dog".

dog, domestic dog, Canis familiaris
   => canine, canid
      => carnivore
         => placental, placental mammal, eutherian, eutherian mammal
          => mammal
             => vertebrate, craniate
               => chordate
                  => animal, animate being, beast, brute, creature, fauna
                     => ...

Latent Dirichlet Allocation

If the words you want to categorize are not covered by WordNet, you could use latent Dirichlet allocation (LDA) to automatically induce semantic categories for the words.

Packages

WordNet is available for download here.

For LDA, you can use either the Stanford Topic Modeling Toolbox (Java) or David Blei's lda-c (C).

dmcer
WordNet is definitely going to be the best available choice for this. The difficulty will be in selecting what you consider as 'high-level categories'. For the example above, dog has 11 hypernym levels and some of it's hyponyms extend down a level or two as well. In contrast, the sport synset has 3 hypernym levels and some of the hyponyms extend down farther (e.g. contact sport->football->soccer).
msbmsb
+1  A: 

Here are many automatically induced word clusters:
http://metaoptimize.com/projects/wordreprs/

Enjoy!

In particular, you could try using the Brown clusters, which will group words into a tree.
Joseph Turian