tags:

views:

223

answers:

3

Does anyone know of a good way to calculate the "semantic distance" between two words?

Immediately an algorithm that counts the steps between words in a thesaurus springs to mind.

+3  A: 

OK, looks like a similar question has already been answered: http://stackoverflow.com/questions/62328/.

Ben Aston
A: 

Possible hack: send the two words to Google search, and return the # of pages found.

Die in Sente
Not sure I understand how this would work.
Ben Aston
@Ben - Essentially, what this does is count the number of documents that the words have in common. For words that are highly selective this might have some merit, but for words that aren't good document discriminators, it's possible that you may get a zero correlation for words that are very closely related.
tvanfosson
+2  A: 

The thesaurus idea has some merit. One idea would be to create a graph based on a thesaurus with the nodes being the words and an edge indicating that there they are listed as synonyms in the thesaurus. You could then use a shortest path algorithm to give you the distance between the nodes as a measure of their similarity.

One difficulty here is that some words have different meanings in different contexts. Your algorithm may need to take this into account and use directed links with the weight of the outgoing link dependent on the incoming link being followed (or ignore some outgoing links based on the incoming link).

tvanfosson
Thanks. Yeah, it's a tricky problem, but with your expansion on my thesaurus idea and by focussing on a subset of the language (e.g. just nouns) then intuitively it sounds possible. I don't have the time to implement such a system right now though.
Ben Aston
Thesaurii don't really form graphs, though. Each entry is a "synset" - a set of synonyms, where all words in the set have equivalent meanings. If a word appears in multiple synsets, it's because that word has multiple meanings - so it's not very useful to draw an edge between the two synsets.
Nick Johnson
@Nick - that's not really my area of expertise, but I can see that it would be difficult to construct an accurate graph since words within the entry itself may be closer or further from the target based on semantics. Perhaps using multiple thesauri and adding 1 for each theasurus that contains the pair in a synset.
tvanfosson
What I mean to say is that when the same set of characters ('word') appears in two different synsets, it's not really the same word - it's a different one spelled the same way, or at the very least a different sense. For example, the "mine" in ["mine", "deposit", "supply"] is not the same word as "mine" in ["mine", "dig up"] and not the same as the one in ["mine", "yours"] - so it doesn't make sense to have an edge between them. Without edges between synsets, you just have a large set of small, disjoint graphs.
Nick Johnson
@Nick, again not an expert, but aren't they typically grouped by meaning. Couldn't you use the common words between sets to identify how to choose which set from a word to use in creating the graph? You'd have to identify a word/meaning pair and link those, not merely the words.
tvanfosson
Right, they're grouped by meaning - so two instances of the same word in different synsets mean different things. You could find synsets that share more than one word in common, but you'd still be conflating different meanings if you drew a link between them.
Nick Johnson