Bonnie, there's a stampede... in your tent!

Mitch Robbins - City Slickers(1991)

I've recently had a problem with a rather large operation (that could probably be optimised considerably, but nevermind), where by if the cached result of the operation expired, several web server threads would attempt the operation, causing some major headaches for our database and web servers. This is something I've come across before, and is commonly(?) known as a Cache Stampede. This is bad, this post describes the basics of what I've done to deal with it.

Basic Caching

Given a Big Fecking Operation, we wrap some simple code around it using a Zend_Cache instance to cache the result for an hour - ignore the use of globals and lack of otherwise good coding practices

<?php

function bfo9000() {
    sleep(5);
    return time();
}

function bfo9000_cached() {
    global $cache;
    $key = 'bfo9000_cache';
    $res = $cache->load($key);
    if (false === $res) {
        $res = bfo9000();
        $cache->save($res, $key, array(), 3600);
    }

    return $res;
}

Semaphores

Once we start to get stampedes, the first thing we want to do is try and limit the amount of threads/processes that are going to try and regenerate the result and store it in the cache again. We can do this using a Binary Semaphore.

Note: Zend cache isn't exactly ideal for this, as the combination of test and save are not atomic, using something like Memcached::add would be better if you really want to avoid two processes generating the cached result.

Our function will now be programmed, when getting a cache miss, to check to see if the semaphore exists, if so, will sleep for a little while before checking again. If it doesn't exists, it will create the semaphore itself, before populating the cache and releasing the semaphore.

<?php

function bfo9000_cached() {
    global $cache;
    $key = 'bfo9000_cache';
    $lockKey = 'bfo9000_cache_lock';
    $res = $cache->load($key);
    if (false === $res) {

        while (false === $res && $cache->test($lockKey)) {
            sleep(1);
            $res = $cache->load($key);
        }

        if (false === $res) {
            $cache->save(true, $lockKey, array(), 10);
            $res = bfo9000();
            $cache->save($res, $key, array(), 3600);
            $cache->remove($lockKey);
        }
    }

    return $res;
}

All being well, we should have prevented the stampede, as several of the threads that require the result of the operation will wait while one (or maybe a few) do the actual ground work.

Shortening the window

We've now prevented the stampede from bringing the servers to their knees, but we're not necessarily providing the best experience for those consumers that simply wait while the other threads do their stuff. To improve this, we can manage the expiry of the cache ourselves and attempt to prime the cache before it fully expires. I'm not sure if there's a name for this technique, but it's mentioned on the Memcached FAQ.

UPDATE: Adam's pointed out that Varnish would call this a Grace Period

We need to store the actual expiry with the data in the cache, rather than relying on the cache software's expiry, luckily for my example, Zend_Cache does that for us. When we get a cache hit, we check to see if the cached result is due to expire soon. If it is, we immediately update the cached result with an expiry date much further in the future, and then set about updating the cache. This way, other processes will see the cached result as valid while our process updates the cache.

<?php

function bfo9000_cached() {
    global $cache;
    $key = 'bfo9000_cache';
    $lockKey = 'bfo9000_cache_lock';
    $res = $cache->load($key);
    if (false === $res) {

        while (false === $res && $cache->test($lockKey)) {
            sleep(1);
            $res = $cache->load($key);
        }

        if (false === $res) {
            $cache->save(true, $lockKey, array(), 30);
            $res = bfo9000();
            $cache->save($res, $key, array(), 3600);
            $cache->remove($lockKey);
        }
    } else {
        $meta = $cache->getMetaDatas($key);
        if ($meta['expire'] < time() + 300 && !$cache->test($lockKey)) {
            $cache->touch($key, 3600);
            $cache->save(true, $lockKey, array(), 30);
            $res = bfo9000();
            $cache->save($res, $key, array(), 3600);
            $cache->remove($lockKey);
        }
    }

    return $res;
}

Note in the main else block, if the semaphore already exists, we leave generating a new cached result to the process that's doing so and simply return the result we already have.

Disclaimer

I wrote all of the code above right here in the post, please don't copy and paste it in to your projects and complain to me if it doesn't work :) I think I've got it just about right and it seems to work for me, let me know nicely of all the mistakes I've made in the comments below.