tags:

views:

80

answers:

2

I am searching in Wordnet for synonyms for a big list of words. The way I have it done it, when some word has more than one synonym, the results are returned in alphabetical order. What I need is to have them ordered by their probability of occurrence, and I would take just the top 1 synonym.

I have used the prolog wordnet database and Syns2Index to convert it into Lucene type index for querying synonyms. Is there a way to get them ordered by their probabilities in this way, or I should use another approach?

Speed not important, this synonym lookup will not be done online.

A: 

I think that you should do another step (provided that speed is not important).

From the Lucene index, you should build another dictionary in which each word is mapped to a small object that contains the only synonym that its meaning has higher probability of appearance, its meaning, and probability of appearance. I.e., given this code:

class Synonym {
public:
    String name;
    double probability;
    String meaning;
}

Map<String, Synonym> m = new HashMap<String, Synonym>();

... you just have to fill it from the Lucene index.

Baltasarq
@Baltasarq, I understand the idea, like you said before, what I need is seems specific: I know that the querying online wordnet returns the synonims by their probability, but I do not understand how is this probability information stored inside this prolog database (which i converted into index with Syns2Index you have linked before) How to retrieve that probability(and is it there?) information and map it inside eg class you proposed?? Thanx!!
Julia
Have you browsed this doc?http://wordnet.princeton.edu/wordnet/man/wnsearch.3WN.html
Baltasarq
@Baltasarq: in case you will need it one day : http://lyle.smu.edu/~tspell/jaws/doc/edu/smu/tspell/wordnet/impl/file/ReferenceSynset.html#getTagCount%28java.lang.String%29
Julia
+1  A: 

In case someone stumbles upon this thread, this was the way to go(at least what i needed):

http://lyle.smu.edu/~tspell/jaws/doc/edu/smu/tspell/wordnet/impl/file/ReferenceSynset.html#getTagCount%28java.lang.String%29

tagCount method gives the most likely synset group for every word. The problem again is that synset with highes probability again can have several words. But i guess theres no chance to avoid this

Julia