Huffman encoding - header & EOF

LDM91 picture LDM91 · Nov 20, 2011 · Viewed 10.3k times · Source

I am currently working on implementing a program based on the huffman algorithm in Java, and I am at the stage where I need to output the encoded content to a file. I am a bit confused about how to implement the header and eof needed for decoding. For my header at the moment I have all the unique values that occur from the input file and their frequency, but on some articles I have seen people do it with 0 or 1 represents the nodes and then the frequency (which I am a bit puzzled by as it doesn't say what the symbol is).

Also, for the EOF as I understand it I encode it like the symbols so it gets read and decoded, however I am not sure what value I can use for it that will definitely not come up? I know it needs to a weight of 1 but was unsure how to make sure it won't actually be in the file.

Answer

mange picture mange · Nov 20, 2011

I've had to do this once for an assignment and this is the approach we used.

Encoding the header was done by using 0 and 1 to encode the structure of the tree (rather than the frequencies). A "0" denoted moving along the tree, a "1" denoted we were at a leaf node. This resulted in a sort of pre-order traversal of the tree encoding it uniquely.

For example, a tree like (((a b) c) (d e)) would be encoded as "0001a1b1c01d1e", where a,b,c,d,e are their ASCII forms.

Here's the tree in a graphical form:

     / \
   /\   /\
 /\  c d  e
a  b 

For the EOF we used the last 3 bits in the file to specify how many of the last two bytes needed to be read. Once we read the last byte (so we were working on the second last byte) we checked the last 3 bits: They encoded how many more bits to read, minus 6. So 110101xxxxxxx000 meant "read 110101 (6 bits) and discard everything else". 1101011xxxxxx001 meant "read 1101011 (7 bits) and discard the rest", etc.

Doing it this way meant we didn't have to have a special value denoting the EOF and we could still read the file a byte at a time (although we actually needed to read one byte ahead of where we were working).

(For the EOF I haven't read your articles, so I don't know if our idea works with your approach.)