The Data Compression Book-:The Dawn Age: Minimum Redundancy Coding

Into the Huffman Code

With the infrastructure code in place, all we need to do to create a program that demonstrates Huffman coding is to write two routines, CompressFile() and ExpandFile(), and a couple of strings that describe the name of the compression method and program usage. The code for this is found in HUFF.C.

To build the Huffman decoding tree, we need to create a data structure that models the tree on the computer. In our previous examples, each node on the tree had several pieces of information: first, the weight associated with it; second, pointers to two child nodes, one associated with the 0 bit and one associated with the 1 bit. Finally, leaf nodes had the value of the symbol associated with the leaf.

The data structure used in this program to model the Huffman tree was built around the node structure:

typedef struct tree_node {
     unsigned int count;
     unsigned int saved_count;
     int child_0;
     int child_1;
} NODE;

The first thing to notice about this structure is that there is no information about the value of a leaf node. This is because the node structures are allocated as an array of 514 nodes. The lower nodes are all assigned to be leaf nodes, and the upper nodes become internal nodes. The information about the value of a leaf is encoded based on the position of the node in the array.

Instead of having 256 symbols in our alphabet for this program, we actually have 257. Values 0 through 255 are reserved for the normal range of bytes that fit into a character. The remaining symbol value of 256 is reserved for the end-of-stream indicator. This is the last code written out to the stream, and it indicates that no more data will be arriving. Because of the bit-oriented nature of compressed data, it is not ordinarily a simple matter to determine when you have reached an end-of-file state. Handling it with a special code for end-of-stream is one method for getting around this. Another would be to encode the length of the file as a prefix to the compressed data.

With 257 symbols to deal with, we know in advance the largest possible size of the Huffman tree. If all 257 symbols are in use, we will have 256 internal nodes, meaning that we have to allocate an array of 513 node structures. In the program, I actually allocate 514 and use the last one as a dummy value for comparisons when building the tree.

Counting the Symbols

To build the tree, I first calculate the relative frequencies of the symbols. In HUFF.C, I set up an array of 256 longs and count the occurrences of every character in the file, from the start to the end. The position of the file input pointer is saved when the count starts and is restored when it is done. All this takes place in function count_bytes().

Though I start with 32-bit unsigned long counts, I scale the counts back significantly in module scale_counts. Scale_counts() finds the maximum count for any symbol in the file, then develops a scaling factor to make that count and all the rest of the counts fit in a single unsigned character. These counts are then copied into the weight elements of the first 257 node elements.

There are several good reasons for scaling back the counts. First, by limiting any symbol’s weight to an 8-bit unsigned character, I can confine all of the math I perform when building the tree to 16-bit unsigned integers. This helps the program run a little faster, and it cuts back on the amount of storage required for the node array. It also limits the maximum size of a Huffman code as well, ensuring that it will fit in a 16-bit unsigned integer.

Saving the Counts

For the expansion program to correctly expand the Huffman encoded bit stream it will be receiving, it needs a copy of the Huffman tree identical to the one used by the encoder. This means that the tree, or its equivalent, must be passed as a header to the file so the expander can read it in before it starts to read Huffman codes.

The easiest way for the expansion program to get this data would probably be to store the entire node array as a preamble to the compressed data. This would work well and would not be too hard for the compressor to do. An alternative method that occupies far less space in the compressed file, however, is to transmit the symbol counts to the expander. Since the Huffman tree is built up in an unambiguous manner from the symbol counts, it stands to reason that the expansion program doesn’t need more to do its job. And since the scaled count array will be only 256 bytes, compared to the Huffman tree’s 4K bytes, there is good reason to choose this.

I elected to try to cut down on the amount of data to be passed even further. Under many circumstances, the number of counts that stay at zero is considerable. With ASCII text files, such as program listings, there will generally be only around 100 symbols in use out of the possible 256. It seems a waste to transmit all those zero counts when they aren’t necessary. To make this happen, I use a slightly more complicated format for the header.

The header used in HUFF.C that contains the symbol counts consists of a series of “count run” definitions, followed by a 0 terminator. A count-run definition consists of the value of the first symbol in the run, followed by the value of the last symbol in the run, followed by the counts for all of the symbols in the run from first to last. This is repeated until each run has been stored in the output file. When there is no more data to store, a first value of zero is written out to the file. Note that a value of zero for the very first run is not treated as an end of data.

For a typical ASCII file, the start of the compressed file might look something like Figure 3.4.

Figure 3.4 The start of a typical compressed ASCII file.

This symbol count format takes a fair amount of work to generate, performed in output_counts() in HUFF.C. Reading in the symbols counts is much simpler, since the work has been done in advance. Reading the counts in from the compressed file during expansion is done in the input_counts() routine.

Table of Contents