The Data Compression Book-:Statistical Modeling

Escape Probabilities

When the program first starts encoding a text stream, it will emit quite a few escape codes. The number of bits used to encode escape characters will probably have a large effect on the compression ratio, particularly in small files. In our first attempts to use escape codes, we set the escape count to 1 and left it there, regardless of the state of the rest of the context table. Bell, Cleary, and Witten call this “Method A.” Method B sets the count for the escape character at the number of symbols presently defined for the context table. If eleven different characters have been seen so far, for example, the escape symbol count will be set at eleven, regardless of what the counts are.

Both methods seem to work fairly well. The code in our previous program can easily be modified to support either one. Probably the best thing about methods A and B is that they are not computationally intensive. Adding the escape symbol to the Method A table can be done so that it takes almost no more work to update the table with the symbol than without it.

The next sample program, ARITH-N.C, implements a slightly more complicated escape-count calculation algorithm. It tries to take into account three different factors when calculating the escape probability. First, as the number of symbol defined in the context table increases, the escape probability naturally decreases. This reaches its minimum when the table has all 256 symbols defined, making the escape probability 0.

Second, it takes into account a measure of randomness in the table. It calculates this by dividing the maximum count in the table by the average count. The higher the ratio, the less random the table. The REQ table, for example, may have only three symbols defined: U, with a count of 50; u, with a count of 10; and e, with a count of 3. The ratio of U’s count, 50, to the average, 21, is fairly high. The U is thus predicted with a relatively high amount of accuracy, and the escape probability ought to be lower. In a table where the high count was 10 and the average was 8, things would seem a little more random, and the escape probability should be higher.

The third factor taken into account is simply the raw count of how many symbols have been seen for the particular table. As the number of symbols increases, the predictability of the table should go up, making the escape probability go down.

The formula I use for calculating the number of counts for the escape symbol is below.

count = (256 - number of symbols seen)*number of symbols seen
count = count /(256 * the highest symbol count)
if count is less than 1
  count = 1

The missing variable in this equation is the raw symbol count. This is implicit in the calculation, because the escape probability is the escape count divided by the raw count. The raw count will automatically scale the escape count to a probability.

Scoreboarding

When using highest-order modeling techniques, an additional enhancement, scoreboarding, can improve compression efficiency. When we first try to compress a symbol, we can generate either the code for the symbol or an escape code. If we generate an escape code, the symbol had not previously occurred in that context, so we had a count of 0. But we do gain some information about the symbol just by generating an escape. We can now generate a list of symbols that did not match the symbol to be encoded. These symbols can temporarily have their counts set to 0 when we calculate the probabilities for lower-order models. The counts will be reset back to their permanent values after the encoding for the particular character is complete. This process is called scoreboarding.

An example of this is shown below. If the present context is HAC and the next symbol is K, we will use the tables shown next to encode the K. Without scoreboarding, the HAC context generates an escape with a probability of 1/6. The AC context generates an escape with a probability of 1/8. The C context generates an escape with a probability of 1/40, and the “” context finally generates a K with a probability of 1/73.


“”	“C”	“AC”	“HAC”
ESC 1	ESC 1	ESC 1	ESC 1
‘K’ 1	‘H’ 20	‘C’ 5	‘E’ 3
‘E 40	‘T’ 11	‘H’ 2	‘L’ 1
‘I’ 22	‘L’ 5		‘C’ 1
‘A 9	‘A’ 3

If we use scoreboarding to exclude counts of previously seen characters, we can make a significant improvement in these probabilities. The first encoding of an escape from HAC isn’t affected, since no characters were seen before. But the AC escape code eliminates the C from its calculations, resulting in a probability of 1/3. The C escape code excludes the H and the A counts, increasing the probability from 1/40 to 1/17. And finally, the “” context excludes the E and A counts, reducing that probability from 1/73 to 1/24. This reduces the number of bits required to encode the symbol from 14.9 to 12.9, a significant savings.

Keeping a symbol scoreboard will almost always result in some improvement in compression, and it will never make things worse. The major problem with scoreboarding is that the probability tables for all of the lower-order contexts have to be recalculated every time the table is accessed. This results in a big increase in the CPU time required to encode text. Scoreboarding is left in ARITH-N.C. to demonstrate the gains possible when compressing text using it.

Table of Contents