RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetch

Preventing cache
stampede with Redis &
XFetch
Jim Nelson <jnelson@archive.org>
Internet Archive
RedisConf 2017

Internet Archive
Universal Access to All Knowledge
Founded 1996, based in San Francisco
Archive of digital and physical media
Includes Web, books, music, film, software & more
Digital holdings: over 30 petabytes & counting
Key collections & services:
Wayback Machine
Grateful Dead live concert collection

Internet Archive ♡ Redis
Caching & other services backed by 10-node sharded Redis cluster
Sharding performed client-side via consistent hashing (PHP, Predis)
Each node supported by two replicated mirrors (fail-over)
Specialized Redis instances also used throughout IA’s services, including
Wayback, search, and more

Caching: Quick terminology
I assume we all know what caching is. This is the terminology I’ll use today:
Recompute: Expensive operation whose result is cached
(database query, file system read, HTTP request to remote service)
Expiration: When a cache value is considered stale or out-of-date
(time-to-live)
Evict: Removing a value from the cache
(to forcibly invalidate a value prior to expiry)

Cache stampede
“A cache stampede is a type of cascading failure that can
occur when massively parallel computing systems with
caching mechanisms come under very high load. This
behaviour is sometimes also called dog-piling.”
–Wikipedia
https://en.wikipedia.org/wiki/Cache_stampede

Cache stampede: A scenario
Multiple servers, each with multiple workers serving requests, accessing a
common cached value
When the cached value expires or is evicted, all workers experience a
simultaneous cache miss
Workers recompute the missing value, causing overload of primary data
sources (e.g. database) and/or hung requests

Congestion collapse
Hung workers due to network congestion or expensive recomputes—that’s bad
Discarded user requests—that’s bad
Overloaded primary data stores (“Sources of Truth”)—that’s bad
Harmonics (peaks & valleys): brief periods of intense activity (mini-outages)
followed by lulls—that’s bad
Imagine a cached value with TTL of 1hr enjoying 10,000 hits/sec—that’s good.
Now imagine @ 1hr+1sec 10,000 cache misses —that’s bad.

Typical cache code
function fetch(name)
var data = redis.get(name)
if (!data)
data = recompute(name)
redis.set(name, expires, data)
return data
This “looks” fine, but consider tens of thousands of simultaneous workers calling this code at once:
no mutual exclusion, no upper-bound to simultaneous recomputes or writes … that’s a cache stampede

Typical stampede solutions
(a) Locking
One worker acquires lock, recomputes, and writes value to cache
Other workers wait for lock to be released, then retry cache read
Primary data source is not overloaded by requests
Redis is often used as a cluster-wide distributed lock:
https://redis.io/topics/distlock

Problems with locking
Introduces extra reads and writes into code path
Starvation: expiration / eviction can lead to blocked workers waiting for a
single worker to finish recompute
Distributed locks may be abandoned

Typical stampede solutions
(b) External recompute
Use a separate process / independent worker to recompute value
Workers never recompute
(Alternately, workers recompute as fall-back when external process fails)

Problems with external recompute
One more “moving part”—a daemon, a cron job, work stealing
Requires fall-back scheme if external recompute fails to run
External recomputation is often not easily deterministic:
caching based on a wide variety of user input
periodic external recomputation of 1,000,000 user records
External recomputation may be inefficient if cached values are never read by

XFetch
(Probabilistic early recomputation)

Probabilistic early recomputation (PER)
Recompute cache values before they expire
Before expiration, one worker “volunteers” to recompute the value
Without evicting old value, volunteer performs expensive recompute—
other workers continue reading cache
Before expiration, volunteer writes new cache value and extends its
time-to-live
Under ideal conditions, there are no cache misses

XFetch
Full paper title: “Optimal Probabilistic Cache Stampede Prevention”
Authors:
Andrea Vattani (Goodreads)
Flavio Chierichetti (Sapienza University)
Keegan Lowenstein (Bugsnag)
Archived at IA:
https://archive.org/details/xfetch

The algorithm
XFetch (“exponential fetch”) is elegant:
delta * beta * loge(rand())
where
delta – Time to recompute value
beta – control (default: 1.0, > 1.0 favors earlier recomputation, < 1.0 favors later)
rand – Random number [ 0.0 … 1.0 ]
Remember: log(0) to log(1) is negative, so XFetch produces negative value

Updated code
function fetch(name)
var data,delta,ttl = redis.get(name, delta, ttl)
if (!data or xfetch(delta, time() + ttl))
var data,recompute_time = recompute(name)
redis.set(name, expires, data), redis.set(delta, expires, recompute_time)
return data
function xfetch(delta, expiry)
/* XFetch is negative; value is being added to time() */
return time() - (delta * BETA * log(rand(0,1))) >= expiry

Can more than one volunteer recompute?
Yes. You should know this before using XFetch.
It’s possible for more than one worker to “roll” the magic number and start a
recompute. The odds of this occurring increase as the expiration deadline
approaches.
If your data source absolutely cannot be accessed by multiple workers, use a
lock or another sentinel—XFetch will minimize lock contention

How to determine delta?
XFetch must be supplied with the time required to recompute.
The easiest approach is to store the duration of the last recompute and read it
with the cached value.

What’s the deal with the beta value?
beta is the one knob you have to tweak XFetch.
beta > 1.0 favors earlier recomputation, < 1.0 favors later recomputation.
My suggestion: Start with the default (1.0), instrument your code, and change
only if necessary.

XFetch & Redis
Let’s look at some sample
code

Redis & XFetch
Jim Nelson <jnelson@archive.org>
Internet Archive
RedisConf 2017

RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetch

More Related Content

What's hot (20)

Similar to RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetch (20)

More from Redis Labs (20)

Recently uploaded (20)

RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetch