The Data Compression Book-:LZ78 Compression

Chapter 9
LZ78 Compression

The LZ77 algorithms have a deficiency that must have troubled their creators, Jacob Ziv and Abraham Lempel. These algorithms use only a small window into previously seen text, which means they continuously throw away valuable phrases because they slide out of the dictionary. The phrase “Jacob Ziv,” for example, may have appeared in the text in Chapter 7. If this book were being compressed using an LZ77 algorithm, it almost certainly would not appear in the sliding window by the time we arrived at this chapter. But the book would compress better if “Jacob Ziv” did appear in the dictionary, since it shows up four or five times in scattered locations throughout the text.

The sliding window makes LZ77 algorithms biased toward exploiting recency in the text. Many, but not all, streams of data tend to “look” more like what has been seen recently than what was seen long ago. Encoding a telephone book is a good example of this. After encoding the first “Jones” entry, it would seem that the word “Jones” was showing up everywhere. But the word “Adams,” seen long ago, would probably not show up at all. As we moved through the phone book, recency effects would be patently obvious, and LZ77 would take advantage of them.

While compressing the phone book, if we looked carefully, we would also notice that some data was more or less immune to recency effects. The street address associated with each listing, for example, would only show the effect faintly, when two listings with the same last name lived at the same address. This would result in fewer matches to text in the window, leading to less effective compression.

A second deficiency in LZ77 compression is the limited size of a phrase that can be matched. The longest match is approximately the size of the look-ahead buffer, which in the previous chapter was just 17 bytes. Many of the seventeen-byte matches found when compressing using that algorithm may actually be much longer.

Can LZ77 Improve?

One obvious way to tackle these problems is simply to start tinkering with the size of the window and the size of the look-ahead buffer. Instead of using a 4K window and a seventeen-byte buffer, for example, why not use a 64K text window and a 1K look-ahead buffer? Wouldn’t that address both problems effectively?

While raising the size of both parameters does seem to address these problems, the scheme has two major drawbacks. First, when we increase the buffer size from 4K to 64K, we now need sixteen bits to encode an index location instead of twelve. And instead of needing four bits to encode a phrase length, we now need ten. So the cost of encoding a phrase rises from seventeen bits to twenty-seven.

This 50 percent increase in the bit size of an index/length token can have a severely negative impact on the compression algorithm. For one thing, it will change the BREAK_EVEN point in the program from just under two characters to three characters. This means that matches of three or fewer characters will no longer be effectively coded in index/length tokens and will instead have to be encoded as single characters. Encoding data as single characters is actually less efficient than plain text, since it needs an additional bit to indicate that a normal character is coming.

An even more distressing effect is that changing these parameters will drastically increase the amount of CPU time needed to perform compression. Under LZ77, just changing the text window size from 4K to 64K will result in the average search taking sixteen times longer, since every string in the window is compared to the look-ahead buffer. The situation is somewhat better under LZSS, since the strings are kept in a binary tree. In this case, the runtime cost of the window size is proportional to the logarithm of the window size. But this still means over a 30 percent increase in runtime.

The real penalty comes when the size of the look-ahead buffer is increased. Since our string comparisons between the text window phrases and the look-ahead buffer proceed sequentially, the runtime here will increase in direct proportion to the length of the look-ahead buffer. Going from sixteen to 1,024 characters means this portion of the program is going to run sixty-four times more slowly—a costly penalty indeed.

These effects combine to effectively cancel out any gains from increasing either of these parameters in an LZ77 algorithm. And even with a 64K text window, we are still effectively tied to an algorithm that depends on recency to perform adequate compression.

Enter LZ78

To effectively sidestep these problems, Ziv and Lempel developed a different form of dictionary-based compression. This algorithm, popularly referred to as LZ78, was published in “Compression of Individual Sequences via Variable-Rate Coding” in IEEE Transactions on Information Theory (September 1978).

LZ78 abandons the concept of a text window. Under LZ77 the dictionary of phrases was defined by a fixed window of previously seen text. Under LZ78, the dictionary is a potentially unlimited list of previously seen phrases.

LZ78 is similar to LZ77 in some ways. LZ77 outputs a series of tokens. Each token has three components: a phrase location, the phrase length, and a character that follows the phrase. LZ78 also outputs a series of tokens with essentially the same meanings. Each LZ78 token consists of a code that selects a given phrase and a single character that follows the phrase. Unlike LZ77, the phrase length is not passed since the decoder knows it.

Unlike LZ77, LZ78 does not have a ready-made window full of text to use as a dictionary. It creates a new phrase each time a token is output, and it adds that phrase to the dictionary. After the phrase is added, it will be available to the encoder at any time in the future, not just for the next few thousand characters.

Table of Contents

Chapter 9LZ78 Compression

Can LZ77 Improve?

Enter LZ78

Chapter 9
LZ78 Compression