views:

299

answers:

5

How to calculate the entropy of a source? Is there a uniform algorithm how to calculate the entropy of a source e.g. mouse mouvments or keystrokes?

A: 

Some years back my team needed to test the sufficiency of our random number generator. One team member had a tool that scanned an output file and graded its entropy. It might take some Googling but it should be out there (or something similar).

xanadont
A: 

In one sense this is trivial: just convert the input to a stream of numbers somehow (anyhow!) and apply the Shannon definition of entropy.

No problem.

The much more interesting (and harder) questions are

  • Does the resulting entropy depend on the conversion you choose? (Partial answer: if it does your choice is suboptimal and you're missing something.)
  • If so, how do I get the most out of of each source?

As others have noted, this problem is equivalent to knowing that you've gotten the best possible compression for some stream of data.


Possible examples for a mouse:

  1. At some regular interval I sample the mouses position and just queue up the X and Y locations to the stream.
  2. As above, but I queue up the differences from the last position
  3. As (1), but I instead of regular sampling I trigger on mouse movement somehow (OS hook, interrupt handler, whatever...)
  4. As (2), but I instead of regular sampling I trigger on mouse movement somehow (OS hook, interrupt handler, whatever...)
  5. As (3), but I queue up position and time
  6. As (4), but I queue up position-difference and time

I would guess that the last two are better than the others.

dmckee
+3  A: 

Just try to compress your data. Good compression means little entropy.

compie
actually through about doing this to compare entropy in two different data sets but then imagined myself presenting, "and here I just gzipped and took the ratio of compressed size to non-compressed."
Austin
+1  A: 

Unlike thermodynamic entropy, entropy in information theory has no absolute measure, it depends on the prediction model you use.

For example if i gave you a million digits of PI starting from position 12000000000, you would find that to be seemingly random, no compressor that you can find, will compress that data and you would conclude that it has very high entropy.

Once I tell you that it's just a piece of PI, the algorithm to generate that sequence is only a few 100 bytes and you conclude that the sequence has low entropy. You may want to look at http://en.wikipedia.org/wiki/Kolmogorov_complexity

The only thing about human generated input data is for example when people hit keys at random, certain n-grams like "asdf" or "qwe" would be more common, and when people move the mouse randomly they make curlicues and cockscrews.

rep_movsd
A: 

thanks a lot for your help and your comment. I don`t know, unfortunately, how i can apply e.g. Shannon-entropy-formula or min-entropy-formula to e.g. mouse movements or keystrokes-entropy source.

leo777