ansaurus

Question

How do I compute the approximate entropy of a bit string?

Answer 1

+1 A:

I believe the answer is the Kolmogorov Complexity of the string. Not only is this not answerable with a chunk of pseudocode, Kolmogorov complexity is not a computable function!

One thing you can do in practice is compress the bit string with the best available data compression algorithm. The more it compresses the lower the entropy.

dreeves 2010-06-05 04:48:17

Answer 2

A:

Shannon's entropy equation is the standard method of calculation. Here is a simple implementation in Python, shamelessly copied from the Revelation codebase, and thus GPL licensed:

def entropy(string):
        "Calculates the Shannon entropy of a string"

        # get probability of chars in string
        prob = [ float(string.count(c)) / len(string) for c in dict.fromkeys(list(string)) ]

        # calculate the entropy
        entropy = - sum([ p * math.log(p) / math.log(2.0) for p in prob ])

        return entropy


def entropy_ideal(length):
        "Calculates the ideal Shannon entropy of a string with given length"

        prob = 1.0 / length

        return -1.0 * length * prob * math.log(prob) / math.log(2.0)

Note that this implementation assumes that your input bit-stream is best represented as bytes. This may or may not be the case for your problem domain. What you really want is your bitstream converted into a string of numbers. Just how you decide on what those numbers are is domain specific. If your numbers really are just one and zeros, then convert your bitstream into an array of ones and zeros. The conversion method you choose will affect the results you get, however.

fmark 2010-06-05 04:50:17

Ah, thank you! But this requires you to know the word length in your bit string? For example, I could apply this to my string of 16,320 bits if I assume that those are really 2040 bytes.

dreeves 2010-06-05 04:56:36

Edited answer to provide info about this

fmark 2010-06-05 05:08:26

If you convert to just ones and zeros then won't this algorithm treat "0101010101..." as having the maximum possible entropy?

dreeves 2010-06-05 05:25:19

No. It should return `1`, whereas the maximum entropy grows according to the string length, as per the `entropy_ideal()` function. For a 1 bit string `1` the entropy is `0` and the max entropy is `0`. For a two bit string `10` the entropy is `1` and the max entropy is `1`. For three bit `101` the entropy is remains `1` but the max entropy increases to `~1.5`. As more bits are added, the entropy should remain constant but the maximum possible entropy should increase.

fmark 2010-06-05 06:01:30

But the code you've given seems not to depend on the order of the bits so a string with 0 and 1 in strict alternation should give the same result as any other string with the same number of 0's and 1's.

dreeves 2010-06-05 07:05:59

That is true - is this a problem? If so, using well-known compression algorithms is the easiest proxy for complexity.

fmark 2010-06-05 09:38:35

As per cypherpunks answer, this assumes a model where every character is equally likely at every position.

GregS 2010-06-05 12:34:39

Answer 3

+3 A:

There is no single answer. Entropy is always relative to some model. When someone talks about a password having limited entropy, they mean "relative to the ability of an intelligent attacker to predict", and it's always an upper bound.

Your problem is, you're trying to measure entropy in order to help you find a model, and that's impossible; what an entropy measurement can tell you is how good a model is.

Having said that, there are some fairly generic models that you can try; they're called compression algorithms. If gzip can compress your data well, you have found at least one model that can predict it well. And gzip is, for example, mostly insensitive to simple substitution. It can handle "wkh" frequently in the text as easily as it can handle "the".

Cypherpunks 2010-06-05 06:49:32

I'm not sure I understand your second paragraph.

dreeves 2010-06-05 06:58:01

Answer 4

+1 A:

Entropy is not a property of the string you got, but of the strings you could have obtained instead. In other words, it qualifies the process by which the string was generated.

In the simple case, you get one string among a set of N possible strings, where each string has the same probability of being chosen than every other, i.e. 1/N. In the situation, the string is said to have an entropy of N. The entropy is often expressed in bits, which is a logarithmic scale: an entropy of "n bits" is an entropy equal to 2ⁿ.

For instance: I like to generate my passwords as two lowercase letters, then two digits, then two lowercase letters, and finally two digits (e.g. va85mw24). Letters and digits are chosen randomly, uniformly, and independently of each other. This process may produce 26*26*10*10*26*26*10*10 = 4569760000 distinct passwords, and all these passwords have equal chances to be selected. The entropy of such a password is then 4569760000, which means about 32.1 bits.

Thomas Pornin 2010-06-08 14:10:43

This is correct but I may not have asked the question correctly. See the answer I gave which perhaps indicates the question I meant to ask. But I think it may in fact be standard to refer to the "approximate entropy" of a bitstring. Regardless, this answer is useful and relevant; thanks!

dreeves 2010-06-08 18:18:26

ansaurus

tags:

views:

answers:

How do I compute the approximate entropy of a bit string?

related questions