The Data Compression Book-:Statistical Modeling

Highest-Order Modeling

The previous sample program used order-1 statistics to compress data. It seems logical that if we move to higher orders, we should achieve better compression. The importance of the escape code becomes even more clear here. When using an order-3 model, we potentially have 16 million context tables to work with (actually 256*256*256, or 16,777,216). We would have to read in an incredible amount of text before those 16 million tables would have enough statistics to start compressing data, and many of those 16 million tables will never be used—which means they take up space in our computer’s memory for no good reason. When compressing English text, for example, it does no good to allocate space for the table QQW. It will never appear.

The solution to this is, again, to set the initial probabilities of all of the symbols to 0 for a given context and to fall back to a different context when a previously unseen symbol occurs. So the obvious question is: what do we use as the fallback context after emitting an escape code? In the last example, we fell back to a default context called the escape context. The escape context was never updated, which meant that using it generally would not provide any compression.

In the higher-order models, there is a better way to compress than just automatically falling back to a default context. If an existing context can’t encode a symbol, fall back to the next smaller-order context. If our existing context was REQ, for example, and U needs to be encoded for the first time, an escape code will be generated. Following that, we drop back to an order-2 model to try to encode the character U using the context EQ. This continues down through the order-0 context. If the escape code is still generated at order-0, we fall back to a special order(-1) context that is never updated and is set up at initialization to have a count of 1 for every possible symbol—so it is guaranteed to encode every symbol.

Using this escape-code technique means only a slight modification to the driver program. The program (see the code found in ARITH-N.C) now sits in a loop trying to encode its characters, as shown here:

do  {
    escaped = convert_int_to_symbol( c, &s );
    encode_symbol( compressed_file, &s );
} while ( escaped );

The modeling code keeps track of what the current order is, decrementing it whenever an escape is emitted. Even more complicated is the modeling module’s job of keeping track of which context table needs to be used for the current order.

Updating the Model

ARITH1E.C does a good job of compressing data. But quite a few improvements can still be made to this simple statistical method without changing the fundamental nature of its algorithm. The rest of this chapter is devoted to discussing those improvements, along with a look at a sample compression module, ARITH-N.C, that implements most of them.

Using the highest-order modeling algorithm requires that instead of keeping just one set of context tables for the highest order, we keep a full set of context tables for every order up to the highest order. If we are doing order-2 modeling, for example, there will be a single order-0 table, 256 order-1 tables, and 65,536 order-2 tables. When a new character is encoded or decoded, the modeling code updates one of these tables for each order. In the example of U following REQ, the modeling code would update the U counters in the order-3 REQ table, the order-2 EQ table, the order-1 Q table, and the order-0 table. The code to update all of these tables is shown next:

for ( order = 0 ; order <= max_order ; order++ )
   update_model( order, symbol );

A slight modification to this algorithm results in both faster updates and better compression. Instead of updating all the different order models for the current context, we can instead update only those models actually used to encode the symbol. This is called “update exclusion,” since it excludes unused lower-order models from being updated. It will generally give a small but noticeable improvement in the compression ratio. Update exclusion works since symbols showing up frequently in the higher-order models won’t be seen as often in the lower-order models, which means we shouldn’t increment the counters in the lower-order models. The modified code for update exclusion will look like this:

for ( order = encoding_order ; order <= max_order ; order++ )
   update_model( order, symbol );

Table of Contents