Dear Everyone, I Hear that google uses up to 7-grams for their semantic-similarity comparison. I am interested in finding words that are similar in context (i.e. cat and dog) and I was wondering how do I compute the similarity of two words on a n-gram model given that n > 2.
So basically given a text, like "hello my name is blah blah. I love cats", and I generate a 3-gram set of the above:
[('hello', 'my', 'name'), ('my', 'name', 'is'), ('name', 'is', 'blah'), ('is', 'blah', 'blah'), ('blah', 'blah', 'I'), ('blah', 'I', 'love'), ('I', 'love', 'cats')]
PLEASE DO NOT RESPOND IF YOU ARE NOT GIVING SUGGESTIONS ON HOW TO DO THIS SPECIFIC NGRAM PROBLEM
What kind of calculations could I use to find the similarity between 'cats' and 'name'? (which should be 0.5) I know how to do this with bigram, simply by dividing freq(cats,name)/ ( freq(cats,) + freq(name,) ). But what about for n > 2?