views:

16

answers:

1

I'm implementing a small dictionary database where I'd like to do searches based on lexical/semantic similarity between them..

For example, beer has "sister words" such as soda, lemonade, wine, champagne each "different" in a "different direction" (in example: the first two are "moderate" versions of the idea of "beer", while the latter two are "more extreme" versions)

I know WordNet has an API, but most of the words (and phrases) in my dictionary are related in more informal ways

(another example. "gangster" is related to [nun, orphan, rebel] {criminal, mafia boss, murderer}, where extremity varies from left to right, and the ones in [] are considered "positive extremities", and the ones in {} are "negative extremities")

In usage:

  1. User enters search input (a word)
  2. Word is matched with sister words.
  3. User has chance to "finetune word" by altering extremities in at least 2 directions, such as in examples above.

What's the best way to implement such a search -- steps 2 and 3 above?

I'm considering using PHP/MySQL since that is what I am familiar with, but what are better alternatives? Again - keep in mind that this isn't a large dictionary. It's just a selection of common words.


Here's my attempt at answering this - it's very, very basic... improvement suggestions welcome:

MySQL table words:


id, (primary key, autoincrement) 
word (varchar 75), 
relatedword (varchar 75)
relationscore (int 11)
direction (tinyint, -1 or 1)

Given a $word query and $direction:

"SELECT relatedword FROM words WHERE word='$word' AND direction=$direction ORDER BY relationscore DESC"

A: 

I'm unclear why you think Wordnet is inappropriate. I think what you're calling "postive/negative extremities" and "sister words" are what linguists call hypernyms (more general synonyms) and hyponyms (more specific synonyms). Wordnet includes a reasonably good model of these.

To use Wordnet, you'd find "sister" words by "going up" a few levels using the hypernyms('beer') relation. So if you started with "beer", going up 3 levels would give you "beverage". Then, you use the hyponyms('beverage') relation to "go down" several levels, to get types of beverages with the same amount of specificity as beer.

This is an example of Wordnet's interface as accessed through Nodebox Linguistics. I believe PHP has an equivalent Wordnet interface, although I've never used it.

>>> import en
>>> noun = 'beer'
>>> generalization_depth = 3
>>> sister_words = en.noun.hyponym(en.noun.hypernyms(noun)[generalization_depth][0])
>>> for word in reduce(lambda a,b: a+b, sister_words, []):
...     print word
... 
milk
wish-wash
potion
alcohol
alcoholic beverage
intoxicant
inebriant
hydromel
oenomel
near beer
ginger beer
mixer
cooler
refresher
smoothie
fizz
cider
cyder
cocoa
chocolate
hot chocolate
drinking chocolate
fruit juice
fruit crush
fruit drink
ade
mate
soft drink
coffee
java
tea
tea-like drink
drinking water
Chris S
well, i guess it's also dependent on classification - for example, a rebel isn't necessarily "bad", but when it comes to murderer/criminal, there's the sense of something clearly negative. it's not specificality per se, but an actual degree of (in this case) "good person", "bad person" classification. in the milk/beer case... beer would be considered more negative/extreme than the others.
ina
@ina, I see what you mean. Since that's a highly subjective criteria, I don't think you'll find any existing databases with "good/bad" classifications of words.
Chris S