views:

2401

answers:

10

Hello!

How to calculate the entropy of a file? (Or let's just say a bunch of bytes)
I have an idea, but I'm not sure that it's mathematically correct.

My idea is the following:

  • Create an array of 256 integers (all zeros).
  • Traverse through the file and for each of its bytes,
    increment the corresponding position in the array.
  • At the end: Calculate the "average" value for the array.
  • Initialize a counter with zero,
    and for each of the array's entries:
    add the entry's difference to "average" to the counter.

Well, now I'm stuck. How to "project" the counter result in such a way that all results would lie between 0.0 and 1.0? But I'm sure, the idea is inconsistent anyway...

I hope someone has better and simpler solutions?

Note: I need the whole thing to make assumptions on the file's contents:
(plaintext, markup, compressed or some binary, ...)

+2  A: 

Entropy is a measure of the uncertainty associated with a random variable. Please see the wiki article.

Also Entropy and Information Theory

Mitch Wheat
+9  A: 

A simpler solution: gzip the file. Use the ratio of file sizes: (size-of-gzipped)/(size-of-original) as measure of randomness (i.e. entropy).

This method doesn't give you the exact absolute value of entropy (because gzip is not an "ideal" compressor), but it's good enough if you need to compare entropy of different sources.

Igor Krivokon
I had also that idea (as last option), but I need to analyze a lot of files, so gzipping ALL of them is not an efficient option.
ivan_ivanovich_ivanoff
It depends on how huge is your ALL. I just tried to gzip all the files in /usr/bin, it's about 1000 files, 200Mb. It took about 7 sec. This is the command you once can use to get the size: cat * | gzip --fast | wc -c. It's slower than just reading the files byte-by-byte, but not by much.
Igor Krivokon
gzip's has had many man-years of programming effort that much optimization. Might as well take advantage of it.
Nosredna
+2  A: 

Check out Scanning data for entropy anomalies

jitter
+14  A: 
  • At the end: Calculate the "average" value for the array.
  • Initialize a counter with zero, and for each of the array's entries: add the entry's difference to "average" to the counter.

With some modifications you can get Shannon's entropy:

rename "average" to "entropy"

(float) entropy = 0
for i in the array[256]:Counts do 
  (float)p = Counts[i] / filesize
  if (p > 0) entropy = entropy - p*lg(p) // lgN is the logarithm with base 2

Edit: As Wesley mentioned, we must divide entropy by 8 in order to adjust it in the range 0 . . 1 (or alternatively, we can use the logarithmic base 256).

Nick D
One correction: you need to skip the elements with Counts[i] == 0.
Igor Krivokon
You are right Krivokon, thanks! I see that Wesley did it correctly except that he choose a 'weird' logarithm base.
Nick D
Yes, it's definitely weird. However, since you're using the more conventional log base 2, you get a value between 0 and 8. You may want to mention this so that the asker can remember to divide the result by 8 to get a value between 0 and 1. (Congrats on the quick answer though - I had to look this stuff up on Wikipedia to remember it. :P)
Wesley
Thank you Wesley!
Nick D
This is a good method, I used it to analyse the "entropy" of image by comparing the pixel data and it gave good results.
Matt Warren
+3  A: 

If you use information theory entropy, mind that it might make sense not to use it on bytes. Say, if your data consists of floats you should instead fit a probability distribution to those floats and calculate the entropy of that distribution.

Or, if the contents of the file is unicode characters, you should use those, etc.

bayer
When I want to do data analysis for any kind of files, byte would be my best choice (as a compromise), I think.
ivan_ivanovich_ivanoff
Of course you can do so. However, you should use any additional information you can get. Otherwise your results can be extremly poor.
bayer
usuallyuseless is absolutely right. The Shannon entropy will not give you enough information about the file contents. Every compressor has 2 stages: modelling and entropy coding. The entropy coding is necessary, but most of the redundancy is detected on the modelling phase (unless you're doing with quasi-random data).
Igor Krivokon
usuallyuseless is right here. One way to figure this out is to say in words the full thing that you're calculating: "what is the entropy of the ascii symbols that I'm using to represent my floating point numbers", is a thing you can calculate, but may not be what you're aiming for.
tom10
Java's capability of MIME tests is limited. It trusts the file extension, and only if the extension is either unknown, or not present, MIME type is guessed by looking inside the file. I am referring to URLConnection:getContentType().
ivan_ivanovich_ivanoff
+6  A: 

To calculate the information entropy of a collection of bytes, you'll need to do something similar to tydok's answer. (tydok's answer works on a collection of bits.)

The following variables are assumed to already exist:

  • byte_counts is 256-element list of the number of bytes with each value in your file. For example, byte_counts[2] is the number of bytes that have the value 2.

  • total is the total number of bytes in your file.

I'll write the following code in Python, but it should be obvious what's going on.

import math

entropy = 0

for count in byte_counts:
    # If no bytes of this value were seen in the value, it doesn't affect
    # the entropy of the file.
    if count == 0:
        continue
    # p is the probability of seeing this byte in the file, as a floating-
    # point number
    p = 1.0 * count / total
    entropy -= p * math.log(p, 256)

There are several things that are important to note.

  • The check for count == 0 is not just an optimization. If count == 0, then p == 0, and log(p) will be undefined ("negative infinity"), causing an error.

  • The 256 in the call to math.log represents the number of discrete values that are possible. A byte composed of eight bits will have 256 possible values.

The resulting value will be between 0 (every single byte in the file is the same) up to 1 (the bytes are evenly divided among every possible value of a byte).


An explanation for the use of log base 256

It is true that this algorithm is usually applied using log base 2. This gives the resulting answer in bits. In such a case, you have a maximum of 8 bits of entropy for any given file. Try it yourself: maximize the entropy of the input by making byte_counts a list of all 1 or 2 or 100. When the bytes of a file are evenly distributed, you'll find that there is an entropy of 8 bits.

It is possible to use other logarithm bases. Using b=2 allows a result in bits, as each bit can have 2 values. Using b=10 puts the result in dits, or decimal bits, as there are 10 possible values for each dit. Using b=256 will give the result in bytes, as each byte can have one of 256 discrete values.

Interestingly, using log identities, you can work out how to convert the resulting entropy between units. Any result obtained in units of bits can be converted to units of bytes by dividing by 8. As an interesting, intentional side-effect, this gives the entropy as a value between 0 and 1.

In summary:

  • You can use various units to express entropy
  • Most people express entropy in bits (b=2)
    • For a collection of bytes, this gives a maximum entropy of 8 bits
    • Since the asker wants a result between 0 and 1, divide this result by 8 for a meaningful value
  • The algorithm above calculates entropy in bytes (b=256)
    • This is equivalent to (entropy in bits) / 8
    • This already gives a value between 0 and 1
Wesley
Thanks for the comment... oh, where'd it go? Anyway, I agree that using "byte frequency" is a little confusing. That term has been removed.
Wesley
+1 now. I agree with your comments and modifications, especially the important clarification that this approach gives the entropy in bytes, whereas the usual value is in bits, though bytes does match more what the OP asked for. (Sorry about the deletion earlier. I decided that I didn't want to get involved in this and hoped that I had deleted my comment before anyone saw it.)
tom10
+1 for your good suggestions :)
Nick D
+2  A: 

There's no such thing as the entropy of a file. In information theory, the entropy is a function of a random variable, not of a fixed data set (well, technically a fixed data set does have an entropy, but that entropy would be 0 — we can regard the data as a random distribution that has only one possible outcome with probability 1).

In order to calculate the entropy, you need a random variable with which to model your file. The entropy will then be the entropy of the distribution of that random variable. This entropy will equal the number of bits of information contained in that random variable.

Adam Rosenfield
I'm not aware of the theoretical definition of entropy. But, there are always two semantics for every term: the theoretical one and the popular one. Well, seems that the popular part was understood by everyone here ;)
ivan_ivanovich_ivanoff
There are at least two obvious interpretations in the answers of how someone might translate "the entropy of a file" into a strict mathematical definition. If you really want to understand what you're doing, you should understand the statistical manner in which entropy is modeled in these answers.
James Thompson
Or you could get into Kolmogorov complexity, which is a better mathematical definition but is uncomputable.
Jeffrey Hantin
+2  A: 

Is this something that ent could handle? (Or perhaps its not available on your platform.)

$ dd if=/dev/urandom of=file bs=1024 count=10
$ ent file
Entropy = 7.983185 bits per byte.
...

As a counter example, here is a file with no entropy.

$ dd if=/dev/zero of=file bs=1024 count=10
$ ent file
Entropy = 0.000000 bits per byte.
...
Peter Kovacs
Thank you! Good to know this tool. But I need to solve this programmatically and in a platform-independent way, hence my question.
ivan_ivanovich_ivanoff
A: 

Most people express entropy in bits

The algorithm above calculates entropy in bytes

I think what you say here is quite misleading. It's one and the same entropy only scaled.

Entropy depends on possible outcomes of random variable (in case of byte there are obviously possible 256 discrete values) and changing base of the logarithm only scales the result.

When it comes to bytes actually using base different than 2^(2^i) (for i in { 1,2,3}) is quite senseless.

There's better simple example where one would like to use other base. Let's say you have oracle that gives output from 'A' to 'Z' and writes each value to a binary file as a byte.

It's obvious that base 26 should be taken, this will fit entropy result's between 0.0 and 1.0.

GiM