The Data Compression Book-:Statistical Modeling

Table of Contents

This code not only creates the 256 context arrays, it also initializes each symbol’s count to 1. At this point, we can begin encoding symbols as they come in. The loop for encoding the symbols looks similar to the one used for other adaptive programs. Here is an order 1 arithmetic compression loop:

for ( ; ; ) {
  c = getc( input );
  if (c == EOF )
    c = END_OF_STREAM;
  convert_int_to_symbol( c, context, &s );
  encode_symbol( output, &s );
  if ( c == END_OF_STREAM )
    break;
  update_model( c, context );
  context = c;
}

This works fairly simply. Instead of just having a single context table, like the code in chapter 5, we now have a set of 256 context tables. Every symbol is encoded using the context table from the previously seen symbol, and only the statistics for the selected context get updated after the symbol is seen. This means we can now more accurately predict the probability of a character’s appearance.

The decoding process for this order 1 code is also very simple, and it looks similar to the decoding example from chapter 5. Here is the order 1 expansion loop:

for ( ; ; ) {
   get_symbol_scale( context, &s );
   count = get_current_count( &s );
   c = convert_symbol_to_int( count, context, &s );
   remove_symbol_from_stream( input, &s );
   if (c == END_OF_STREAM )
     break;
   putc( (char) c, output );
   update_model( c, context );
   context = c;
}

The only difference between this and conventional order-0 code is the addition of the context variable, both within the loop and as a parameter to other functions. The remaining routines that differ from the code in Chapter 5 are are shown next. The C source for this module is included on the program disk.

void update_model( int symbol, int context )
   int i;

   for ( i = symbol + 1 ; i <= ( END_OF_STREAM + 1 ) ; i++ )
      totals[ context ][ i ]++;
   if ( totals[ context ][ END_OF_STREAM + 1 ] < MAXIMUM_SCALE )
       return;
   for ( i = 1 ; i <= ( END_OF_STREAM + 1 ) ; i++ ) {
      totals[ context ][ i ] /= 2;
      if ( totals[ context ][ i ] <= totals[ context ][ i - 1 ] )
       totals[ context ][ i ] = totals[ context ][ i - 1 ] + 1;
   }
}

void convert_int_to_symbol( int c, int context, SYMBOL *s )
{
  s->scale = totals[ context ][ END_OF_STREAM + ];
  s->low_count = totals[ context ][ c ];
  s->high_count = totals[ context ][ c + 1 ];
}

void get_symbol_scale( int context, SYMBOL *s )
{
  s->scale = totals[ context][ END_OF_STREAM + 1 ];
}
int convert_symbol_to_int( int count, int context, SYMBOL *s)
{

  int c;

   for ( c = 0; count >= totals[ context ][ c + 1 ] ; c++ )
     ;
   s->high_count = totals[ context ][ c + 1 ];
   s->low_count = totals[ context ][ c ];
   return( c );
}

Using the Escape Code as a Fallback

The simple order-1 program does in fact do a creditable job of compression, but it has a couple of problems to address. First, the model for this program makes it a slow starter. Every context starts off with 257 symbols initialized to a single count, meaning every symbol starts off being encoded in roughly eight bits. As new symbols are added to the table, they will gradually begin to be encoded in fewer bits. This process, however, will not happen very quickly.

For the context table for the letter q, for example, we will probably see a a very high number of u symbols. The very first u will have a probability of 1/257, and will accordingly be encoded in eight bits. The second u will have a probability of 2/258, but will still require over seven bits to encode. In fact, it will take sixteen consecutive u symbols with no other appearances before the entropy of the symbol is reduced to even four bits.

The reason for this slow reduction in bit count is obvious. The probability of the u symbol is being weighted down by the other 256 symbols in the table. Though they may never appear in the message, they need a nonzero count. If their count were reduced to zero, we would not be able to encode them if and when they appeared in the message.

There is a solution to this problem, however, and it is relatively painless. Instead of having every symbol appear automatically in every table, start off with a nearly empty table and add symbols to the table only as they appear. The q table would have zero counts for all the other symbols, giving the first u that appears a low bit count.

But there is a catch here. If a symbol doesn’t appear in a context table, how will it be encoded when it appears in a message? The easiest way to accomplish this is to use an escape code. The escape code is a special symbol (much like the end-of-stream symbol) that indicates we need to “escape” from the current context.

Table of Contents