views:

283

answers:

4

For example...

Chicken is an animal.
Burrito is a food.

WordNet allows you to do "is-a"...the hiearchy feature.

However, how do I know when to stop travelling up the tree? I want a LEVEL.
That is consistent.

For example, if presented with a bunch of words, I want wordNet to categorize all of them, but at a certain level, so it doesn't go too far up. Categorizing "burrito" as a "thing" is too broad, yet "mexican wrapped food" is too specific. I want to go up the hiearchy or down..until the right LEVEL.

+5  A: 

WordNet is a lexicon rather than an ontology, so 'levels' don't really apply.

There is SUMO, which is an upper ontology which relates to WordNet if you want a directed lattice instead of a network.

For some domains, SUMO's mid-level ontology is probably where you want to look, but I'm not sure it has 'mexican wrapped food', as most of its topics are scientific or engineering.

WordNet's hierarchy is

beef burrito < burrito < dish/2 < victuals < food < substance < entity.

Entity is a top-level concept, so if you stop one-below substance you'll get burrito isa food. You can calculate a level based on that, but it wont' necessarily be as consistent as SUMO, or generate your own set of useful mid-level concepts to terminate at. There is no 'mexican wrapped food' step in WordNet.

Pete Kirkham
Most of SUMO is science or engineering? It does not contain every-day words like foods, people, cars, jobs, etc?
TIMEX
SUMO is an upper ontology. The mid-level ontologies (where you would find concepts between 'thing' and 'beef burrito') listed on the page don't include food, but reflect the sorts of organisations which fund the project. There is a mid-level ontology for people. There's also one for industries (and hence jobs), including food suppliers, but no mention of burritos if you grep it.
Pete Kirkham
Thanks, Pete. f
TIMEX
+1  A: 

In order to get levels, you need to predefine the content of each level. An ontology often defines these as the immediate IS_A children of a specific concept, but if that is absent, you need to develop a method of that yourself.

The next step is to put a priority on each concept, in case you want to present only one category for each word. The priority can be done in multiple ways, for instance as the count of IS_A relations between the category and the word, or manually selected priorities for each category. For each word, you can then pick the category with the highest priority. For instance, you may want meat to be "food" rather than chemical substance.

You may also want to pick some words, that change priority if they are in the path. For instance, if you want some chemicals which are also food, to be announced as chemicals, but others should still be food.

Lars D
+3  A: 

[Please credit Pete Kirkham, he first came with the reference to SUMO which may well answer the question asked by Alex, the OP]

(I'm just providing a complement of information here; I started in a comment field but soon ran out of space and layout capabilites...)

Alex: Most of SUMO is science or engineering? It does not contain every-day words like foods, people, cars, jobs, etc?
Pete K: SUMO is an upper ontology. The mid-level ontologies (where you would find concepts between 'thing' and 'beef burrito') listed on the page don't include food, but reflect the sorts of organisations which fund the project. There is a mid-level ontology for people. There's also one for industries (and hence jobs), including food suppliers, but no mention of burritos if you grep it.

My two cents
100% of WordNet (3.0 i.e. the latest, as well as older versions) is mapped to SUMO, and that may just be what Alex need. The mid-level ontologies associated with SUMO (or rather with MILO) are effectively in specific domains, and do not, at this time, include Foodstuff, but since WordNet does (include all -well, many of- these everyday things) you do not need to leverage any formal ontology "under" SUMO, but instead use Sumo's WordNet mapping (possibly in addition to WordNet, which, again, is not an ontology but with its informal and loose "hierarchy" may also help.

Some difficulty may arise, however, from two area (and then some ;-) ?):

  • the SUMO ontology's "level" may not be the level you'd have in mind for your particular application. For example while "Burrito" brings "Food", at top level entity in SUMO "Chicken" brings well "Chicken" which only through a long chain finds "Animal" (specifically: Chicken->Poultry->Bird->Warm_Blooded_Vertebrae->Vertebrae->Animal).
  • Wordnet's coverage and metadata is impressive, but with regards to the mid-level concepts can be a bit inconsistent. For example "our" Burrito's hypernym is appropriately "Dish", which provides it with circa 140 food dishes, which includes generics such as "Soup" or "Casserole" as well as "Chicken Marengo" (but omitting say "Chicken Cacciatore")

My point, in bringing up these issues, is not to criticize WordNet or SUMO and its related ontologies, but rather to illustrate simply some of the challenges associated with building ontology, particularly at the mid-level.

Regardless of some possible flaws and lackings of a solution based on SUMO and WordNet, a pragmatic use of these frameworks may well "fit the bill" (85% of the time)

mjv
Thank you for clarification. If my objective was to scan a document and see what food, jobs, hobbies, interests that person has...how would you advise that I go about this? Would it be best to find a word-list of "food" and a word-list of "hobbies" and "sports"? What's the most Practical way of doing this?
TIMEX
@Alex: Because you are targeting relatively few domains, I'd consider developing your own lexicons. You could "prime" these by extracting them from the SUMO Wordnet map or similar sources. You'll probably need to also build a list of named entities (such as artists, athletes, cities, particular venues etc.). Although building such lists isn't inexpensive, you'll find that the resulting reduced domain allows much sloppier logic/heuristics for similar (or typically better) precision and recall in the taging.
mjv
A: 

WordNet's hypernym tree ends with a single root synset for the word "entity". If you are using WordNet's C library, then you can get a while recursive structure for a synset's ancestors using traceptrs_ds, and you can get the whole synset tree by recursively following nextss and ptrlst pointers until you hit null pointers.

Ken Bloom