I think what Amazon calls "Statiscal Improbable Phrases" are words that are improbable with respect to their huge corpus of data. In effect, even if a word is repeated 1000 times in a given book A, if that book is the only place where it appears, then it is a SIP, because the probability of it appearing in any given book is zilch (because it is specific to book A). You can't really duplicate this wealth of data to compare information from, unless you work yourself with lots of data.
What is lots of data ? Well, if you are analyzing literary texts, then you would want to download and process a couple thousand books from Gutenberg. But if you are analyzing legal texts, then you'd have to specifically feed in the content of legal books.
If, as is probably the case, you don't have lots of data as a luxury, then you have to rely, one way or another, on frequency analysis. But instead of considering relative frequencies (fractions of the text, as is often considered), consider absolute frequencies.
For instance, hapax legomenon also known in the network analysis domain as 1-mice, could be of particular interest. They are words that only appear once in a given text. For instance, in James Joyce's Ulysses, these words only appear once: postexilic, corrosive, romanys, macrocosm, diaconal, compressibility, aungier. They are not statistical improbable phrases (as would be "Leopold Bloom") so they don't characterize the book. But they are terms that are rare enough that they only appear once in this writer's expression, so you can consider that they characterize, in a way, his expression. They are words that, unlike common words like "the", "color", "bad", etc. he expressly sought to use.
So these are an interesting artifact, and the thing is, they are pretty easy to extract (think O(N) with constant memory), unlike other, more complex, indicators. (And if you want elements which are slightly more frequent, then you can turn to 2-mice, ..., 10-mice which are similarly easy to extract.)