The Data Compression Book-:LZ78 Compression

Table of Contents

The next_code variable is then set to the first available code. This program uses code 256 as an end-of-file marker, so the first code defined will have a value of 257. As new strings are read in and defined, this number grows until it reaches the maximum code value of 4,095.

Finally, the first character is read in and assigned to the loop variable string_code. We can arbitrarily assign the first character to a code value, since it is a special single-character string.

After initialization, the main encoding loop begins. The working variable, string_code, keeps track of which code matches the characters read in so far. When the program first starts, that is just as single-character string, but as the dictionary grows, string_code can represent very long strings.

A single character is read in from the input file, then find_child_node() is called to see if the current string_code has a child node that corresponds to that character. If it does, the child’s code is assigned to string_code, and we move back to the top of the loop.

If there is no child node, we have reached the end of our string match. When this occurs, we output the current code, then start over with a new string_code. Finally, we add the new string created by the combination of string_code and character to the dictionary so the next time it occurs we will get a match.

The main loop repeats until an end-of-file is read in. When this occurs, we output the string_code built up so far. Finally, the END_OF_STREAM code is output, which tells the decoder when we are at the end of the data stream.

Decompression

As mentioned previously, maintaining the dictionary is simpler during decompression. We don’t ever have to navigate down through the tree. Instead, we read in codes straight from the encoded stream, then work our way up through the tree. As long as the parent nodes are properly defined in the data structure, everything works fine.

The only problem with working up through the tree is that the decoded characters are gathered in reverse order, so they have to be pushed into a stack, popped off in reverse order, and written to the output file. This is done with the decode_string routine, shown next.

decode_string() follows the parent pointers up though the dictionary until it finds a code less than 256, which we have defined as the first character in the string. A count of characters in the decode stack is then returned to the calling program.

    unsigned int decode_string( count, code )
    unsigned int count;
    unsigned int code;
    {
       while ( code > 255 ) {
        decode_stack[ count++ } = dict[ code ].character;
        code = dict[ code].parent_code;
    }
    decode_stack[ count++ ] = (char) code;
    return( count );
   }

Once the decode routine is in place, the decompression routine falls into order fairly easily. The routine has a few lines of initialization code, followed by a main decoding loop.

The initialization section of the decompression routine in LZW12.C sets up the next_code variable. This let it track the code value of each string as it is added to the table. Next, is reads in the first code and copies it to the output file. Once that is done, it can enter the main decoding loop.

   next_code = FIRST_CODE;
   old_code = InputBits( input, BITS );
   if ( old_code == END_OF STREAM )
    return;
   character = old_code;
   putc( old_code, output );
   while ( ( new_code = InputBits( input, BITS ) ) != END_OF_STREAM ) {
    if ( new_code >= next_code ) {
     decode_stack[ 0 ] = (char) character;
     count = decode_string( 1, old_code );
    }
    else
     count = decode_string( 0, new_code );
    character = decode_stack[ count - 1 ];
    while ( count > 0 )
     putc( decode_stack[ --count ], output );
    if ( next_code <= MAX_CODE ) {
      dict[ next_code ].parent_code = old_code;
      dict[ next_code ].character = (char) character;
    next_code++;
   }
   old_code = new_code;
 }

Normally, decoding is a simple matter. The loop reads in a code value, looks up the string, and outputs it. Then it create a new string by adding the old_code and the first character of the current string to the string table. It then goes back to the top of the loop and starts over.

But an additional complication is created when the CHARACTER+ STRING+CHARACTER+STRING+CHARACTER sequence shows up. This creates a code larger than the largest currently defined code. Fortunately, we know what to do in this case. Our new string will be the same as our last string, defined by old_code, with a copy of its first character appended to its end. This is handled by preinitializing the decoding_stack with a single character, then decoding the old_code string into the stack with an offset of one instead of zero.

Table of Contents