views:

135

answers:

2

I am looking for a method to build a hierarchy of words.

Background: I am a "amateur" natural language processing enthusiast and right now one of the problems that I am interested in is determining the hierarchy of word semantics from a group of words.

For example, if I have the set which contains a "super" representation of others, i.e.

[cat, dog, monkey, animal, bird, ... ]

I am interested to use any technique which would allow me to extract the word 'animal' which has the most meaningful and accurate representation of the other words inside this set.

Note: they are NOT the same in meaning. cat != dog != monkey != animal BUT cat is a subset of animal and dog is a subset of animal.

I know by now a lot of you will be telling me to use wordnet. Well, I will try to but I am actually interested in doing a very domain specific area which WordNet doesn't apply because: 1) Most words are not found in Wordnet 2) All the words are in another language; translation is possible but is to limited effect.

another example would be:

[ noise reduction, focal length, flash, functionality, .. ]

so functionality includes everything in this set.

I have also tried crawling wikipedia pages and applying some techniques on td-idf etc but wikipedia pages doesn't really do much either.

Can someone possibly enlighten me as to what direction my research should go towards? (I could use anything)

+1  A: 

The opinion mining and sentiment analysis folks might be doing related things, in terms of deciding what words represent features of products, without knowing anything about the products.

A quick sketch of an idea for how you might do this, which I've totally made up on the spot: Parse a bunch of sentences in the relevant domain; find the noun phrases and adjectives. Figure out which noun phrases are associated with which adjectives. Cluster the noun phrases together based on the set of adjectives used to describe them. Animals will tend together because they're going to be described by adjectives like "furry" or "cute", etc. (In particular, hierarchical clustering would probably be most appropriate.)

If you try this, and it works, let me know. :)

Jay Kominek
The OP said that the sets already exist and that the task is to find the most representative element of a set. What you suggest is not an answer to the question. That being said, adjectives alone will not help in clustering semantically similar nouns because most are just too widely applicable, e.g. 'cute' could be applied to girls, pieces of music, movies, social situations, etc. You need a lot more context such as typical nouns and verbs to make your idea work with at least some accuracy.
ferdystschenko
It certainly does appear that I misread the question. That said, I don't think that broadly applicable adjectives would be a very big deal. If everything is close in the 'cute' dimension, then 'cute' will just end up not having much effect on the clusters.
Jay Kominek
What I was trying to say was that I doubt there are adjectives discriminative for most kinds of concept clusters. Plus I don't a reason why you should limit your features to adjectives when there are other word classes potentially even more descriptive. E.g. animals may co-occur with nouns and verbs like 'forest', 'zoo', 'prey', 'hunt', etc. For a start, I wouldn't even parse the sentences but use a simple n-gram (perhaps even unigram) approach.
ferdystschenko
+3  A: 

It looks like you want to use something like the hypernym/hyponym relationships in WordNet, but without actually using WordNet due to language and domain specific coverage issues? That is, if you had the domain specific hypernym relationships, you could get the "super" representation by just looking for the nearest parent that subsumed all of the words in the list, or the nearest node that was equal to one of the list words and subsumed all of the others.

To start, I would first point out that WordNets are actually available for many of the worlds major languages see the list at Global WordNet.

To get domain specific hypernym relationships, you could use the technique presented in Snow et al.'s Learning syntactic patterns for automatic hypernym discovery. That is, you could start off with a small list of seed hypernyms, and then use them to train a classifier to detected the hypernyms in a corpus. You would then run this classifier over data from your domain in order to build a list of domain specific hypernym pairs.

dmcer