views:

719

answers:

2

I am trying to find words (specifically physical objects) related to a single word. For example:

Tennis: tennis racket, tennis ball, tennis shoe

Snooker: snooker cue, snooker ball, chalk

Chess: chessboard, chess piece

Bookcase: book

I have tried to use WordNet, specifically the meronym semantic relationship; however, this method is not consistent as the results below show:

Tennis: serve, volley, foot-fault, set point, return, advantage

Snooker: nothing

Chess: chess move, checkerboard (whose own meronym relationships shows ‘square’ & 'diagonal')

Bookcase: shelve

Weighting of terms will eventually be required, but that is not really a concern now.

Anyone have any suggestions on how to do this?


Just an update: Ended up using a mixture of both Jeff's and StompChicken's answers.

The quality of information retrieved from Wikipedia is excellent, specifically how (unsurprisingly) there is so much relevant information (in comparison to some corpora where terms such as 'blog' and 'ipod' do not exist).

The range of results from Wikipedia is the best part. The software is able to match terms such as (lists cut for brevity):

  • golf: [ball, iron, tee, bag, club]
  • photography: [camera, film, photograph, art, image]
  • fishing: [fish, net, hook, trap, bait, lure, rod]

The biggest problem is classifying certain words as physical artefacts; default WordNet is not a reliable resource as many terms (such as 'ipod', and even 'trampolining') do not exist in it.

+3  A: 

In the first case, you probably are looking for n-grams where n = 2. You can get them from places like Google or create your own from all of Wikipedia.

For more information, check out this related Stack Overflow question.

Jeff Moser
n-grams of 2 are simply all the word pairs that appear together commonly. I think what the poster was asking was about *semantic* relationships.
Avi
The idea is that if they appear together, there might be some semantic relationship if it occurs enough (e.g. "tennis racket") has a semantic relationship just like "play tennis" does.
Jeff Moser
That is a correct idea. However, the question was about using WordNet to find semantic relationships, not about using n-grams.
Avi
Sorry if I was unclear, the question is not WordNet specific.The n-gram method sounds interesting, I do not see how it can work for my problem though as there is just the singular word (such as ‘tennis’) to try to find relationships.
S0rin
The idea with n-grams is that you could see common words that are near "tennis." The fact that they appear near each other shows there probably is some relationship.
Jeff Moser
I've have done something similar to your answer using a combination of word frequencies (from Wikipedia) and semantics (WordNet).Many thanks.
S0rin
Cool! How is it working out for you? What's the quality of the suggestions?
Jeff Moser
(response is in the question section as there isn't enough room here)
S0rin
+5  A: 

I think what you are asking for is a source of semantic relationships between concepts. For that, I can think of a number of ways to go:

  1. Semantic similarity algorithms. These algorithms usually perform a tree walk over the relationships in Wordnet to come up with a real-valued score of how related two terms are. These will be limited by how well WordNet models the concepts that you are interested in. WordNet::Similarity (written in Perl) is pretty good.
  2. Try using OpenCyc as a knowledge base. OpenCyc is a open-source version of Cyc, a very large knowledge base of 'real-world' facts. It should have a much richer set of sematic realtionships than WordNet does. However, I have never used OpenCyc so I can't speak to how complete it is, or how easy it is to use.
  3. n-gram frequency analysis. As mentioned by Jeff Moser. A data-driven approach that can 'discover' relationships from large amounts of data, but can often produce noisy results.
  4. Latent Semantic Analysis. A data-driven approach similar to n-gram frequency analysis that finds sets of semantically related words.

[...]

Judging by what you say you want to do, I think the last two options are more likely to be successful. If the relationships are not in Wordnet then semantic similarity won't work and OpenCyc doesn't seem to know much about snooker other than the fact that it exists.

I think a combination of both n-grams and LSA (or something like it) would be a good idea. N-gram frequencies will find concepts tightly bound to your target concept (e.g. tennis ball) and LSA would find related concepts mentioned in the same sentence/document (e.g. net, serve). Also, if you are only interested in nouns, filtering your output to contain only nouns or noun phrases (by using a part-of-speech tagger) might improve results.

StompChicken
Many thanks, your information has given me a lot to investigate.
S0rin
No problem, good luck in what you are trying to do. It's not easy :)
StompChicken