SQUID Frequently Asked Questions: Cache Digests

Previous Next Table of Contents

Cache Digests allow cooperating caches to exchange digests of their contents in a compact form. If Cache A has a digest of cache B, then A knows what documents are likely to be (or not to be) in cache B. Consequently, cache A can decide whether to fetch an object from cache B without contacting cache B first. Cache Digests use a ``lossy compression'' technique which causes them to have some inaccuracies.

One may think about Cache Digests as zero-delay ICP with somewhat less precise information.

Cache Digest eliminates the need for per-client ``Do you have it?'' queries and network delays associated with them. The client response time improves. Network bandwidth requirements may improve.

16.2 How does this work?

Cache Digest is implemented as two virtually independent modules or algorithms described below.

Local Digest

A cache maintains its local digest as a variation of a Bloom filter. Bloom filter is a bit array with bits turned on in the positions corresponding to hash values of the data (URL MD5s in our case). Bloom filters have been around for ages and are commonly used in Databases and Linguistic applications.

In short, whenever an object gets cached, we turn on a few bits based on object's URL. Later, if somebody wants to know if a certain URL is cached, she needs to perform the same hashing sequence and check that all the corresponding bits are ``on''.

Note that due to collisions, a digest may return a ``hit'' when object is not in the cache. This phenomenon is called false-hit. In its pure implementation, Bloom filter can never return a ``miss'' for a cached object (no false-misses).

Local digests are stored as ordinary objects on disk and could be requested by other caches. There is also a memory-resident copy that keeps track on current updates. Disk copy of a local digest is synchronized periodically with the up-to-date in-memory copy.

Collection of Peer Digests

A cache requests local digests from its peers via standard HTTP protocol. A collection of peer digests is kept up-to-date based on Expire HTTP headers supplied with each digest. Peer digests are stored the same way local digests are: memory-resident copy for fast lookups, and disk resident copy for recovery and sharing.

When client requests an object, the cache does a lookup in all the digests it has and goes to the closest peer which digest gave a hit. If all digests gave misses, we currently go direct.

Note that a given cache may maintain local digest but have no collection of peer digests and vv. These two modules are independent of each other.

16.3 Why Cache Digest? What about ICP?

ICP works the best when the number of peers is small, network connectivity is good, and round trip delays between peers are short. To route an HTTP request, Squid sends an ICP message to every applicable peer and then waits for replies from those peers. This waiting often creates excessive client delays. Clearly, the scheme does not scale well with the number of peers and is sensitive to poor network conditions.

Cache Digests do not send any messages on a per-request basis. It is keen at guessing where a missing object can be. In many cache meshes Cache Digests are a much better approach compared to ICP.

Nothing comes for free. Cache Digests have three potential problems.

Big caches may have big digests. Currently we allocate 4 bits per digest entry and the number of entries is about 1.5 times the maximum number of objects in the cache.
False hits are as bad as with ICP. With Cache Digest they may happen more often.
False misses might force your cache to fetch an object from an origin server while it is available from a peer right next to you.

No cooperative caching scheme is perfect. There always will be configurations where one or another algorithm performs better. We believe that with proper tuning and experience, the advantages of Cache Digest approach will far outweigh its problems.

16.4 So do you have any real measurements?

Our initial research and results have been written up in a paper to be presented at the Third International Web Caching Workshop.

16.5 Are there any similar developments?

There are quite a few, actually. The most relevant work is the Summary Cache approach proposed by Pei Cao. Summary Cache also utilizes Bloom filters but uses different (often opposite) design decisions regarding information flow, summary updates, etc.

16.6 How Can I Try it?

The Cache Digest feature is available in Squid-2. To compile code for Cache Digests, you need to add a configure option:

        ./configure --enable-cache-digests ...

16.7 How do I know its working?

Check your access.log and look at the 9th field. When a request is forwarded to a neighbor due to a Cache Digest, the hierarchy log code will be CACHE_DIGEST_HIT.

16.8 Is the Cache Digest spec documented anywhere?

It is now, thanks to Martin Hamilton and Alex Rousskov.

Cache Digests, as implemented in Squid 2.1.PATCH2, is described in cache-digest-v5.txt. You'll notice the format is similar to an Internet Draft. We decided not to submit this document as a draft because Cache Digests will likely undergo some important changes before we want to try to make it a standard.