How to calculate the entropy of a file?

ivan_ivanovich_ivanoff picture ivan_ivanovich_ivanoff · Jun 13, 2009 · Viewed 59.2k times · Source

How to calculate the entropy of a file? (Or let's just say a bunch of bytes)
I have an idea, but I'm not sure that it's mathematically correct.

My idea is the following:

  • Create an array of 256 integers (all zeros).
  • Traverse through the file and for each of its bytes,
    increment the corresponding position in the array.
  • At the end: Calculate the "average" value for the array.
  • Initialize a counter with zero,
    and for each of the array's entries:
    add the entry's difference to "average" to the counter.

Well, now I'm stuck. How to "project" the counter result in such a way that all results would lie between 0.0 and 1.0? But I'm sure, the idea is inconsistent anyway...

I hope someone has better and simpler solutions?

Note: I need the whole thing to make assumptions on the file's contents:
(plaintext, markup, compressed or some binary, ...)

Answer

Nick Dandoulakis picture Nick Dandoulakis · Jun 13, 2009
  • At the end: Calculate the "average" value for the array.
  • Initialize a counter with zero, and for each of the array's entries: add the entry's difference to "average" to the counter.

With some modifications you can get Shannon's entropy:

rename "average" to "entropy"

(float) entropy = 0
for i in the array[256]:Counts do 
  (float)p = Counts[i] / filesize
  if (p > 0) entropy = entropy - p*lg(p) // lgN is the logarithm with base 2

Edit: As Wesley mentioned, we must divide entropy by 8 in order to adjust it in the range 0 . . 1 (or alternatively, we can use the logarithmic base 256).