views:

550

answers:

6

How does something like Statistically Improbable Phrases work?

According to amazon:

Amazon.com's Statistically Improbable Phrases, or "SIPs", are the most distinctive phrases in the text of books in the Search Inside!™ program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.

SIPs are not necessarily improbable within a particular book, but they are improbable relative to all books in Search Inside!. For example, most SIPs for a book on taxes are tax related. But because we display SIPs in order of their improbability score, the first SIPs will be on tax topics that this book mentions more often than other tax books. For works of fiction, SIPs tend to be distinctive word combinations that often hint at important plot elements.

For instance, for Joel's first book, the SIPs are: leaky abstractions, antialiased text, own dog food, bug count, daily builds, bug database, software schedules

One interesting complication is that these are phrases of either 2 or 3 words. This makes things a little more interesting because these phrases can overlap with or contain each other.

+1  A: 

I am fairly sure its the combination of SIPs that identify the book as unique. In your example its very rare almost impossible that another book has "leaky abstractions" and "own dog food" in the same book.

I am however making an assumption here as I do not know for sure.

Steven
+8  A: 

They are probably using a variation on the tf-idf weight, detecting phrases that occur a high number of times in the specific book but few times in the whole corpus minus the specific book. Repeat for each book.

Thus 'improbability' is relative to the whole corpus and could be understood as 'uniqueness', or 'what makes a book unique compared to the rest of the library'.

Of course, I'm just guessing.

Vinko Vrsalovic
+11  A: 

It's a lot like the way Lucene ranks documents for a given search query. They use a metric called TF-IDF, where TF is term frequence and idf is inverse document frequency. The former ranks a document higher the more the query terms appear in that document, and the latter ranks a document higher if it has terms from the query that appear infrequently across all documents. The specific way they calculate it is log(number of documents / number of documents with the term) - ie, the inverse of the frequency that the term appears.

So in your example, those phrases are SIPs relative to Joel's book because they are rare phrases (appearing in few books) and they appear multiple times in his book.

Edit: in response to the question about 2-grams and 3-grams, overlap doesn't matter. Consider the sentence "my two dogs are brown". Here, the list of 2-grams is ["my two", "two dogs", "dogs are", "are brown"], and the list of 3-grams is ["my two dogs", "two dogs are", "dogs are brown"]. As I mentioned in my comment, with overlap you get N-1 2-grams and N-2 3-grams for a stream of N words. Because 2-grams can only equal other 2-grams and likewise for 3-grams, you can handle each of these cases separately. When processing 2-grams, every "word" will be a 2-gram, etc.

danben
it's a little trickier than that though, because phrases can be 2 or 3 words in length, which could overlap or contain each other. tf-idf is usually described with single terms only.
ʞɔıu
I'm not sure that matters so much, especially if its restricted to phrases of length 3 or less. For a text stream of N tokens, you have N-1 bigrams and N-1 trigrams. Of course, a bigram is only going to be equal to another bigram, and likewise for a trigram, so you could compute the IDF measures of bigrams and trigrams just as quickly as you could do it for words.
danben
@ʞɔıu: It's usually described in single terms, but there's no need to apply it that way. That's why I mentioned 'a variation on' in my answer. danben's explanation covers it.
Vinko Vrsalovic
what about potential overlaps between the 2-grams and 3-grams? And will 2- and 3-grams need different idf thresholds?
ʞɔıu
thanks for the update, I guess I still don't quite get it though...
ʞɔıu
Do you have a specific question? I'm happy to explain further, but I don't know what it is you don't get.
danben
+2  A: 

As a starting point, I'd look at Markov Chains.

One option:

  1. build a text corpus from the full index.
  2. build a text corpus from just the one book.
  3. for every m to n word phrase, find the probability that each corpus would generate it.
  4. select the N phrases with the highest ratio of probabilities.
BCS
+2  A: 

LingPipe has a tutorial on how to do this, and they link to references. They don't discuss the math behind it, but their source code is open so you can look in their source code.

I can't say I know what Amazon does, because they probably keep it a secret (or at least they just haven't bothered to tell anyone).

Ken Bloom
+1  A: 

You can also see this (SIPs, Collocations, etc.) in Sematext's Key Phrase Extractor: http://sematext.com/products/key-phrase-extractor/index.html . You can see the demo with real, live, constantly changing (news) data there, too.

Otis Gospodnetic