Hi,
Is there a partition of english words into a high level categories like say sports, basketball etc... Its required for my project.
Is this data available somewhere? I am okay with overlapping of words across categories.
Thank you Bala
Hi,
Is there a partition of english words into a high level categories like say sports, basketball etc... Its required for my project.
Is this data available somewhere? I am okay with overlapping of words across categories.
Thank you Bala
WordNet
The hierarchy of hypernyms/hyponyms in WordNet will give you categorizations of words at varying degrees of specificity.
Borrowing the example from Wikipedia, you can extract the following set of categories for the word "dog".
dog, domestic dog, Canis familiaris
=> canine, canid
=> carnivore
=> placental, placental mammal, eutherian, eutherian mammal
=> mammal
=> vertebrate, craniate
=> chordate
=> animal, animate being, beast, brute, creature, fauna
=> ...
Latent Dirichlet Allocation
If the words you want to categorize are not covered by WordNet, you could use latent Dirichlet allocation (LDA) to automatically induce semantic categories for the words.
Packages
WordNet is available for download here.
For LDA, you can use either the Stanford Topic Modeling Toolbox (Java) or David Blei's lda-c (C).
Here are many automatically induced word clusters:
http://metaoptimize.com/projects/wordreprs/
Enjoy!