The Data Compression Book-:Introduction to Data Compression

Chapter 1
Introduction to Data Compression

The primary purpose of this book is to explain various data-compression techniques using the C programming language. Data compression seeks to reduce the number of bits used to store or transmit information. It encompasses a wide variety of software and hardware compression techniques which can be so unlike one another that they have little in common except that they compress data. The LZW algorithm used in the Compuserve GIF specification, for example, has virtually nothing in common with the CCITT G.721 specification used to compress digitized voice over phone lines.

This book will not take a comprehensive look at every variety of data compression. The field has grown in the last 25 years to a point where this is simply not possible. What this book will cover are the various types of data compression commonly used on personal and midsized computers, including compression of binary programs, data, sound, and graphics.

Furthermore, this book will either ignore or only lightly cover data-compression techniques that rely on hardware for practical use or that require hardware applications. Many of today’s voice-compression schemes were designed for the worldwide fixed-bandwidth digital telecommunications networks. These compression schemes are intellectually interesting, but they require a specific type of hardware tuned to the fixed bandwidth of the communications channel. Different algorithms that don’t have to meet this requirement are used to compress digitized voice on a PC, and these algorithms generally offer better performance.

Some of the most interesting areas in data compression today, however, do concern compression techniques just becoming possible with new and more powerful hardware. Lossy image compression, like that used in multimedia systems, for example, can now be implemented on standard desktop platforms. This book will cover practical ways to both experiment with and implement some of the algorithms used in these techniques.

The Audience

You will need basic programming skills to adequately discuss data-compression code. The ability to follow block-structured code, such as C or Pascal, is a requirement. In addition, understanding computer architecture well enough to follow bit-oriented operations, such as shifting, logical ORing and ANDing, and so on, will be essential.

This does not mean that you need to be a C guru for this book to be worthwhile. You don’t even have to be a programmer. But the ability to follow code will be essential, because the concepts discussed here will be illustrated with portable C programs. The C code in this book has been written with an eye toward simplicity in the hopes that C novices will still be able to follow the programs. We will avoid the more esoteric constructs of C, but the code will be working tested C—no pseudocode or English.

Why C?

The use of C to illustrate data-compression algorithms may raise some hackles, although less so these days than when the first edition of this book came out. A more traditional way to write this book would have been to use pseudocode to sketch out the algorithms. But the lack of rigor in a pseudocode “program” often leads to hazy or incomplete definitions full of lines like “PROCESS FILE UNTIL OUT OF DATA.” The result is that pseudocode is easy to read, but not so easy to translate into a working program.

If pseudocode is unsatisfactory, the next best choice is to use a conventional programming language. Though hundreds of choices are available, C seems the best choice for this type of book for several good reasons. First, in many respects C has become the lingua franca of programmers. That C compilers support computers ranging from a lowly 8051 microcontroller to supercomputers capable of 100 million instructions per second (MIPS) has had much to do with this. It doesn’t mean that C is the language of choice for all programmers. What it does mean is that most programmers should have a C compiler available for their machines, and most are probably regularly exposed to C code. Because of this, many programmers who use other languages can still manage to code in C, and even more can at least read C.

A second reason for using C is that it is a language without too many surprises. The few constructs it uses as basic language elements are easily translated to other languages. So a data-compression program that is illustrated using C can be converted to a working Pascal program through a relatively straightforward translation procedure. Even assembly-language programmers should find the process relatively painless.

Perhaps the most important reason for using C is simply one of efficiency. C is often thought of as a high-level assembly language, since it allows programmers to get close to the hardware. Despite the increasing optimization found in recent C compilers, it is not likely that C will ever exceed the speed or size possible in hand-coded assembly language. That flaw is offset, however, by the ability to easily port C code to other machines. So for a book of this type, C is probably the most efficient choice.

Which C?

Despite being advertised as a “portable” language, a C program that compiles and executes on a given machine is not guaranteed to run on any other. It may not even compile using a different compiler on the same machine. The important thing to remember is not that C is portable, but that it can be portable. The code for this book has been written to be portable, and it compiles and runs cleanly using several compilers and environments. The compilers/environments used here include:

• Microsoft Visual C++ 1.5, MS-DOS 5.0/6.22

• Borland C++ 4.0-4.5, MS-DOS 5.0/6.22

• Symantec C++ 6.0-7.0, MS-DOS 5.0/6.22

• Interactive Unix System 3.2 with the portable C compiler

• Solaris 2.4 with SunSoft compiler

• Linux 1.1 with the GNU C compiler

Table of Contents

Chapter 1Introduction to Data Compression

The Audience

Why C?

Which C?

Chapter 1
Introduction to Data Compression