The Data Compression Book-:Dictionary-Based Compression

Table of Contents

The output of a QIC-122 encoder consists of a stream of data, which, in turn, consists of tokens and symbols freely intermixed. Each token or symbol is prefixed by a single bit flag that indicates whether the following data is a dictionary reference or a plain symbol. The definitions for these two sequences are: (1) plaintext: <1><eight-bit-symbol>; (2) dictionary reference: <0><window-offset><phrase-length>.

The QIC-122 encoder complicates things by further encoding the window-offset and phrase-length codes. Window offsets of less than 128 bytes are encoded in seven bits. Offsets between 128 bytes and 2,047 bytes are encoded in eleven bits. The phrase length uses a variable-bit coding scheme which favors short phrases over long. This explanation will gloss over these as “implementation details.” The glossed-over version of the C code for this algorithm is shown here.

while ( !out_of_symbols ) {
  length = find_longest_match(&offset);
  if ( length > 1 ) {
    output_bit( 0 );
    length = find_longest_match( &offset );
    output_bits( offset );
    output_bits( length );
    shift_input_buffer( length );
  } else {
    output_bit( 1 );
    output_byte( buffer[ 0 ] );
    shift_input_buffer( 1 );
  }
}

Following is an example of what this sliding window looks like when used to encode some C code, in this case the phrase “output_byte.” The previously encoded text, which ends with the phrase “output_bit( 1 );\r,” is at the end of the window. The find_longest_match routine will return a value of 8, since the first eight characters of “output_byte” match the first eight characters of “output_bit.” The encoder will then output a 0 bit to indicate that a dictionary reference is following. Next it will output a 15 to indicate that the start of the phrase is fifteen characters back into the window (‘\r’ is a single symbol). Finally, it will output an 8 to indicate that there are eight matching symbols from the phrase.

Figure 7.1 A sliding window used to encode some C code.

Using QIC-122 encoding, this will take exactly sixteen bits to encode, which means it encodes 8 bytes of data with only 2 bytes. This is clearly a respectable compression ratio, typical of how QIC-122 works under the best circumstances as shown here:

Figure 7.2 Encoding 8 bytes of data using only 2 bytes.

After the dictionary reference is output, the input stream over eight characters, with the last symbol encoded becoming the last symbol in the window. The next three symbols will not match anything in the window, so they will have to be individually encoded.

This example of QIC-122 gives a brief look at how a dictionary-based compression scheme might work. Chapter 8 will take a more extensive look at LZ77 and its derivatives.

Israeli Roots

Dig beneath the surface of virtually any dictionary-based compression program, and you will find the work of Jacob Ziv and Abraham Lempel. For all practical purposes, these two Israeli researchers gave birth to this branch of information theory in the late 1970s.

Research in data compression up to 1977 included work on entropy, character and word frequencies, and various other facets of statistical modeling. There were minor forays into other esoteric areas of interest, such as finite state machines and linguistic models, but research went mainly into new and improved methodologies for driving Huffman coders.

All this changed in 1977 with the publication of Jacob Ziv’s and Abraham Lempel’s “A Universal Algorithm for Sequential Data Compression” in IEEE Transactions on Information Theory. This paper, with its 1978 sequel “Compression of Individual Sequences via Variable-Rate Coding,” triggered a flood of dictionary-based compression research, algorithms, and programs.

The two compression techniques developed in these papers are called LZ77 and LZ78 (the transposition of the author’s initials is apparently an innocent historical accident, but one that is here to stay). LZ77 is a “sliding window” technique in which the dictionary consists of a set of fixed-length phrases found in a “window” into the previously processed text. The size of the window is generally somewhere between 2K and 16K bytes, with the maximum phrase length ranging from perhaps 16 to 64 bytes. LZ78 takes a completely different approach to building a dictionary. Instead of using fixed-length phrases from a window into the text, LZ78 builds phrases up one symbol at a time, adding a new symbol to an existing phrase when a match occurs.

It is easy to think that these two compression methods are closely related, particularly since people will casually speak of “Lempel Ziv Compression” as if it were just one thing. These are, however, two very different techniques. They have different implementation problems and solutions, different strengths and weaknesses. Since they have had such a large impact on the world of data compression, the next two chapters of this book will take a detailed look at an LZ77 and an LZ78 implementation.

Table of Contents