views:

345

answers:

3

Dear Everyone, I Hear that google uses up to 7-grams for their semantic-similarity comparison. I am interested in finding words that are similar in context (i.e. cat and dog) and I was wondering how do I compute the similarity of two words on a n-gram model given that n > 2.

So basically given a text, like "hello my name is blah blah. I love cats", and I generate a 3-gram set of the above:

[('hello', 'my', 'name'), ('my', 'name', 'is'), ('name', 'is', 'blah'), ('is', 'blah', 'blah'), ('blah', 'blah', 'I'), ('blah', 'I', 'love'), ('I', 'love', 'cats')]

PLEASE DO NOT RESPOND IF YOU ARE NOT GIVING SUGGESTIONS ON HOW TO DO THIS SPECIFIC NGRAM PROBLEM

What kind of calculations could I use to find the similarity between 'cats' and 'name'? (which should be 0.5) I know how to do this with bigram, simply by dividing freq(cats,name)/ ( freq(cats,) + freq(name,) ). But what about for n > 2?

+2  A: 

The following is not an answer to the question, but I believe a helpful commentary about NLP-related questions.
It is marked as CW, please feel free to edit. Also please suggest a possibly better place for this "meta" commentary; I realize that such text is typically placed in comments to the question but this wouldn't accommodate a sufficient size and an better layout.


NLP is a rather broad domain!

with many disciplines and associated techniques...
Yet, it seems to be often referenced, as in the above question (and several other SO questions), as if it were a single technology/method such as say, RAID storage or Singular Value Decomposition.
This trend [of citing NLP as a monolithic technology] is maybe tied to the recent coming into mainstream -with variable but oft' notable success- of various NLP-driven solutions ranging from spell-checking to automatic translation, to named entity extraction, to sentiment analysis...

So... here are a few generic recommendations to foster better answers with NLP-related questions (they reproduce general recommendations applicable to all SO posting, but are particularly needed for the reasons mentioned):

  • if possible, get a very high level review of the various NLP disciplines, and possibly a more detailed view of a few particular areas of application.
  • because NLP (unlike, say, MVC or SQL) doesn't point to any particular technique or even concept, one should attempt to provide a better description of what he/she is trying to achieve, using examples if possible.
  • unless they are [or appear to be] closely related, avoid asking multiple questions per post (standard SO recommendation)
mjv
A: 

I don't know how google works but one known method is calculating the co-occurrence in documents given words. Taking into account, google have all documents possible then it is pretty easy to calculate that factor and occurrence of a word (frequency) you can then get a bond factor between words. It is not a measure of similarity (like cat and dog) but rather something more collocation.

Take a look: http://en.wikipedia.org/wiki/Tf–idf

Another approach would be to drop internet documents, only focus on dictionary entries, there were several attempts to parse those entries an build "common knowledge" system. This way you could get relationships automatically (WordNet and alike are manually crafted).

macias
This question is specifically asking how you could apply ngram to do semantic similarity. I don't think this is what I am looking for
sadawd
Simply don't take whole document into account, but only N-gram. Read "Foundations of Statistical Natural Language Processing" by Christopher D. Manning and Hinrich Schuetze (chapter about collocation detection, I believe it is relevant to your question).
macias
A: 

I googled "similarities between trigrams" and came up with this article which breaks words up into 3 letter segments. I know that is not exactly what you are looking for, but maybe this will help enough to get you going.

The article also compares 2 words based on the 3 letter approach. It seems like the comparison would need to be between two search terms, like "hello my name is blah blah. I love cats" and "my name is something else. I love dogs". Of course I don't know much about the domain, so if that is incorrect, my apologies, I was just hoping to spur some thought for your question.

NickLarsen
yeah thx, it doesn't really help but I guess the ideas are still thereThis article does mainly comparison on character-level ngram
sadawd