Mu Cache
Mu Cache
in Microservice Graphs
Haoran Zhang, Konstantinos Kallas, Spyros Pavlatos, Rajeev Alur, Sebastian Angel,
and Vincent Liu, University of Pennsylvania
https://www.usenix.org/conference/nsdi24/presentation/zhang-haoran
Haoran Zhang*, Konstantinos Kallas*, Spyros Pavlatos, Rajeev Alur, Sebastian Angel, Vincent Liu
University of Pennsylvania
USENIX Association 21st USENIX Symposium on Networked Systems Design and Implementation 221
/endpoint Node 1 /endpoint1 (RO) Node 2
can developers do today? Broadly speaking, developers today /endpoint2
add caching by either (a) creating manual or application- /…
Clients Service1 W Service2 W …
specific coherence protocols, which are error-prone and fail
to generalize; (b) focusing on the backend-storage layer [24, C C
W W
30], which ignores the significant advantages of terminating
request call graphs early; or (c) giving up on consistency and D
CM
D
CM
222 21st USENIX Symposium on Networked Systems Design and Implementation USENIX Association
F IGURE 2—(Left) Moview Review application fragment. (Right) An example execution of this application. Each line corresponds to a different
component, and arrows denote communication. (C) components are caches and (CM) cache managers.
USENIX Association 21st USENIX Symposium on Networked Systems Design and Implementation 223
(1) (2) 4 MuCache Protocol
S2 (1) call(“write”, k, v)
(2) call(“write”, k, v)
S1 S4 Figure 5 shows the complete MuCache protocol for the wrap-
(3) call(“read”, k)
(4) call(“read”, k)
pers of a single service shard and its cache manager in Python-
(3) S3 (4) like pseudocode. The wrapper of each service communicates
with its associated cache manager through an ordered mes-
F IGURE 4—An application exhibiting a “diamond” pattern. sage queue (using SendToCM). Downstream cache managers
munication between shards on the request processing critical also issue Save/Inv events to upstream ones through the same
path—cache managers of different shards only communi- queue. Cache managers in different shards of the same ser-
cate invalidation in the background. At the same time, the vice also communicate with each other when broadcasting
invalidation delays in MuCache are very small (ms)—orders invalidations using SendToShardCM.
of magnitude smaller than standard values of TTL used in The code on the left depicts wrapper logic run before
practice to evict cache items (seconds to hours) (§7). a request starts processing (preReqStart), when a request
has finished processing (preReturn), when a request reads
Correctness. The correctness condition for MuCache is based
from a key (preRead), before a request performs a call to
on classical refinement modulo reordering, i.e. that all behav-
another service (preCall), and after a request writes to a key
iors exposed by a cache-enabled application are equivalent to
(postWrite). The code on the right depicts cache manager
a behavior of a cache-free version after potentially reordering
logic, which processes events in the message queue sent by
independent observable events. The execution in Figure 3 is
the wrappers and cache managers of other services.
correct because it could have been observed from the original
application if the write (1) had happened right after (2) and Wrapper. The wrappers keep two types of state. The first
(3) since they are independent requests from different clients. is a global (per service shard) readsets map from request
Guaranteeing correctness is challenging for call graphs identifiers to sets of keys and call arguments, which keeps
with more than one path between the same two services, i.e., the dependencies of each pending read-only (RO) request.
when a request accesses the same backend service twice in The second is the per-request context ctx, which is carried
its lifetime. Figure 4 shows such an example of a ‘diamond’ around while a request is processed. ctx contains (1) the
pattern. In this example, a service S1 first calls S2, which in id of the request (ctx.call_id); (2) the hash value of the
turn calls S4 that writes to its store. Then S1 calls S3, which request’s arguments (ctx.ca); (3) the caller of the request
calls S4 trying to read from the same value that was written (ctx.caller); (4) the visited services of the request and
by S2. It is possible that S1 could find the result of a previous its subrequests (ctx.visited); and (5) whether the current
call to S3 in its cache, reading a stale value, leading to an request is read-only and, therefore, cacheable by its caller
execution that would not be observable without caches. Since (ctx.isRO). Wrappers send a Start(ca) message to their as-
microservice call graphs are dynamic, this pattern cannot be sociated cache manager before a request starts processing and
identified and prevented statically (before execution). Mu- then maintain the request readset when a read or a subrequest
Cache addresses this at runtime by keeping track of visited is performed. Once the request is complete, the entire readset,
services during request processing, not checking a cached along with the call arguments, the caller, the return value,
entry if it depends on a service that has already been visited. and the visited services are sent to the cache manager as
an End(ca, rs, caller, ret, vs) message. Wrappers also
Summary. We conclude this section by describing how Mu-
send Inv(k) messages to cache managers after a datastore key
Cache satisfies the previously stated requirements:
k is modified. preCall checks the cache before invocation
• Correctness: We prove that MuCache does not introduce and returns directly upon cache hits.
behaviors that are not part of the original application (§5).
Cache manager. The cache manager controls the contents of
• Non-blocking and low overhead: Cache managers do all
the cache. The cache manager contains two global (per service
processing in the background and the wrappers only send
shard) state components: saved and history. The saved map
messages to them, never blocking for a response.
acts as an inverted index of wrappers’ readsets by mapping
• Dynamic graphs: MuCache tracks dependencies to guar- keys (or call arguments) to the corresponding service that
antee correctness in the presence of dynamic call graphs. has read (or called) them. When a key or a set of calls is
• Sharding: MuCache supports sharding without any addi- invalidated, the cache manager looks up saved to locate all
tional communication on the critical path. the affected upstream services and asks them to invalidate the
• Application and datastore agnostic: MuCache does not set of relevant calls that they have cached by sending them
require any modification to the application or datastore Inv messages. The second state component, history, is a
code because wrappers intercept all communication. sequence of calls and invalidations used to determine whether
• Incremental deployment: Developers can gradually declare a call can be safely cached upstream. When a request with
read-only endpoints to get incremental benefits. readset rs is complete, the cache manager scans the history
in reverse chronological order for invalidations that intersect
224 21st USENIX Symposium on Networked Systems Design and Implementation USENIX Association
1 global readsets : map(Key, set(Key | CallArgs)) 1 # Tracks which keys and calls will invalidate
2 2 # which cache entries upstream
3 def preReqStart(ctx): 3 global saved : map(Key | CallArgs, map(Service, CallArgs))
4 if ctx.isRO: 4 # Sequence of calls and invalidations
5 cid, ca = ctx.call_id, ctx.ca 5 global history : list(Call(CallArgs) | Inv(Key | CallArgs))
6 readsets[cid] = set() 6
7 SendToCM(Start(ca)) 7 def startHandler(ca):
8 8 history.append(Call(ca))
9 def preReturn(ctx, ret): 9
10 if ctx.isRO: 10 def endHandler(ca, rs, caller, ret, vs):
11 cid, ca = ctx.call_id, ctx.ca 11 # Checks if there are any invalidations
12 rs = readsets.pop(cid) 12 # to the readset since the call start
13 caller = ctx.caller 13 if empty([for Inv(k) in history.invs_after(Call(ca))
14 vs = ctx.visited 14 if k in rs]):
15 SendToCM(End(ca, rs, caller, ret, vs)) 15 SendToCM(caller, Save(ca, ret, vs))
16 16 saved.store(rs, ca, caller)
17 def postWrite(ctx, k, _v): 17
18 SendToCM(Inv(k)) 18 def invHandler(k):
19 19 match type(k):
20 def preRead(ctx, k): 20 case Key:
21 if ctx.isRO: 21 history.append(Inv(k))
22 cid = ctx.call_id 22 case CallArgs:
23 readsets[cid].insert(k) 23 history.extend([Inv(ca) for ca in k])
24 24 # Inform CMs of same-service shards
25 def preCall(ctx, ca): 25 SendToShardCMs(Inv(k)) # (see Sec. 4.1)
26 if ctx.isRO: 26 # Ask all affected callers to invalidate
27 cid = ctx.call_id 27 affected = saved.pop(k)
28 readsets[cid].insert(ca) 28 for caller, cas in affected:
29 # Check if ca refers to a read-only endpoint and if 29 SendToCM(caller, Inv(cas))
30 # the visited services are disjoint with the cache 30
31 # item subtree 31 def saveHandler(ca, ret, vs):
32 if ca.isRO and visited_disjoint(ctx, ca): 32 save_visited(ca, vs)
33 return cache.get(ca) 33 cache.set(ca, ret)
34 return None
F IGURE 5—(Left) The wrapper code of the protocol that intercepts the start of request processing, returns, writes, reads, and calls. (Right) The
cache manager code that processes work queue items sent by the wrappers and other cache managers.
S1 (r)
(1) Call(ca) S.T1 S.T1 (r) Read(k)
(2) (6)
(2) Return v
(s) (e) (w) (i) (w) Write(k, v)
S1.C (w) (r)
(1) (5) (3) End(ca, …) S.T2 S.T2 (s) Start(ca)
S1.CM (i)
(4) Inv(ca) (s) (e) (e) End(ca, …)
(4)
S2 (5) Cache.delete(ca) S.CM S.CM (i) Inv(k)
(3) (6) Cache.save(ca) -> v
S2.CM F IGURE 7—Possible imprecisions in invalidation. The three lines
represent two service threads processing requests and the cache
F IGURE 6—A bug that would occur if invalidate messages were
manager.
allowed to overtake saves.
ager could track the exact order of all reads and writes to pre-
with rs since the call started. If there is no such invalidation,
cisely track invalidations. Since requests are being processed
it asks the upstream cache manager to save the result. The
concurrently, this would require coordination across different
cost of this scan is proportional to the product of request
service threads, which would significantly slow down request
rate and average request duration, which is typically a small
processing along the critical path. MuCache relaxes the track-
number. For example, a service handling 10,000 requests
ing of reads and writes in two ways that do not jeopardize cor-
per second, each lasting 100 milliseconds, requires scanning
rectness, but reduce the synchronization overhead. First, all
several thousand items.
reads of a request are gathered by the wrappers (preRead) and
Saving a new cache entry. A naive method of saving a new are only sent to the cache manager at the end of the request
entry involves the caller immediately saving it to the cache (the rs argument in the End message). To ensure correctness,
upon the result’s arrival, rather than awaiting an explicit Save the cache manager then assumes that all reads happened at
message from the callee’s cache manager. This is not correct, the start of the call, considering the call invalid if a write
as it allows the bug shown in Figure 6 where the invalidation happened in its duration even if it happened before the reads
message by the S2 cache manager “overtakes” the save done (Fig. 7, Left). Second, writes are intercepted in a non-atomic
by S1, leading to the cache entry never being invalidated. fashion after they have been completed (postWrite). This
Thus, it is necessary for Invs and Saves to not be reordered. could allow for a call to start and complete in between the
MuCache achieves that by issuing them sequentially through actual write and postWrite, leading to its cached response
the cache manager. being unnecessarily invalidated (Fig. 7, Right).
Invalidating an entry. Invalidations are triggered when a key Evicting an entry. There are two types of evictions in Mu-
used in a cached result is modified. Naively, the cache man- Cache. First, a cache could fill up with entries and needs to
USENIX Association 21st USENIX Symposium on Networked Systems Design and Implementation 225
evict an entry to make space for new ones; in this case, the manager also stores the services, S’, that were visited during
eviction is safe without any additional work since the proto- the processing of ca. Before checking the cache, the wrap-
col is robust to re-invalidations (i.e., it is safe to invalidate a per checks if the downstream service has ever visited a ser-
cache entry that was previously evicted). Second, the cache vice in S’ that has also been visited by the current request
manager might need to reclaim space if it is keeping track of visited_disjoint(ctx, ca); if so, it does not retrieve the
the dependencies of many calls. It reclaims space by evicting return value from the cache to preserve correctness. MuCache
a key or call from its saved dependencies and consequently tracks visited services using a binary encoding that keeps its
sends invalidation messages to all affected calls upstream as size small—less than 1 KB for 1000 services.
if the key were invalidated (see inv(k) in Figure 5).
Garbage collection. The cache manager has two state com- 5 Protocol Correctness
ponents that grow during execution: (1) the history and (2) To demonstrate the correctness of MuCache, we show that
the dependencies. It keeps the history bounded by remov- clients cannot differentiate a MuCache-enabled application
ing completed calls when processing an End request, adding from the original without caches. We give semantics to mi-
minimal overhead. The protocol preserves correctness in the croservice applications (with and without caches) using ob-
presence of multiple pending calls with the same arguments servable execution events and traces. Events are indivisible
by removing the latest occurrence of a call start (potentially actions (steps) that can be performed by a microservice ap-
overapproximating the duration of the other calls). When the plication; examples of events include reading from a key in
cache manager reaches a memory limit, it deletes some of its the datastore and receiving a response from a completed sub-
saved dependencies as long as it informs the upstream caches request. An application can be uniquely described by the set
to evict relevant entries (similarly to a normal invalidation). of traces (event sequences) that can be observed in it. Two
The current implementation evicts dependencies following an traces are said to be equivalent modulo reordering when all
LRU policy, though other choices could be used. events in one trace exist in the other trace but potentially in a
Sharding. MuCache supports sharded service deployments different order. Reorderings are necessary for our correctness
by attaching a cache manager to each shard; the only require- theorem to allow reads and writes to proceed concurrently (as
ment being that read-only calls with the same arguments are in Figure 3). In this section, we informally describe three as-
always processed by the same shard (e.g., by load balancing sumptions that are central to our formal development, the first
these calls based on a hash of the call arguments). This guar- two hold for all microservice applications, and the last one is
antees that a single cache manager is the sole authority for a requirement of MuCache. We then state our main theorem
the invalidations of each read-only call, ensuring that they and give the high-level intuition for the proof. The complete
will be the only ones to send cache-save and cache-invalidate formal development and proof can be found in Appendix A,
messages for that call. The only change in the protocol is that which is available in the supplementary material.
a cache manager processing an invalidate due to a write needs (A1) Always enabled requests. Requests in a microservice
to broadcast it to all cache managers of the other shards of the application only block when waiting for subrequests that they
same service, so that they can invalidate their relevant calls have invoked to finish executing and there is no blocking
(see L.25 in Figure 5). It is important to note that broadcasts communication across independent requests. In other words,
only happen upon users’ writes; transitive invalidations propa- if a trace can be observed in an application, then we can pick
gated upstream do not trigger broadcasts. Broadcasting out of and execute any pending request, or any of its subrequests,
the critical path is safe because, similarly to the single-shard until it produces an execution event, and the new trace will
protocol, overapproximating the write duration might lead to also be part of the application’s set of traces.
additional invalidations but not fewer. MuCache, therefore,
does not add any latency overhead on the request processing (A2) Reordering independent events. Two events are depen-
critical path to support sharded services. dent when the first event affects the execution of the second:
some examples include two events that are part of the same
Handling Dynamic Call-graphs. Microservice applications request, or a write and a read event to the same key in a ser-
can exhibit a diamond pattern (Figure 4) where a request vice datastore. The complete definition of dependent events
performs multiple subrequests to the same service through is given in Appendix A. We assume that due to multithread-
its lifetime. In such applications naive caching could lead ing, independent events commute; that is, reordering any two
to executions that cannot be observed without caches. Mu- consecutive independent events in an application trace results
Cache addresses this by keeping track of the visited services in a trace that can also be observed by the application.
in two locations. First, each request keeps visited services in
its context (ctx.visited); whenever a subrequest ca returns, (A3) Linearizable datastores. We assume that the datastore
the parent request adds all the visited services of the subre- of each service is linearizable [26]: operations on an object
quest (ca.visited) to its own visited services (ctx.visited). take place atomically, in an order consistent with the oper-
Second, when saving a cache entry for call ca, the cache ations’ real-time order. For instance, if a write completes
before a read begins, then the read must observe the effects
226 21st USENIX Symposium on Networked Systems Design and Implementation USENIX Association
of the write and complete after it. This is necessary due to to a single communication protocol, cache, or datastore. Our
the requirement that MuCache does not modify the under- wrappers are built on top of Dapr [3], a service mesh extended
lying datastore and can only observe writes to the datastore to also support state accesses through its API. Dapr supports
before or after they are completed. If we were to use a non- custom middlewares that can be used to intercept invocations
linearizable datastore a write could take effect after it returns, and state accesses. It also provides a common abstraction for
making it impossible to track which calls it invalidates. many service communication protocols and different storage
backends, allowing us to implement our wrappers once and
Theorem 1 (Protocol Correctness). For all traces in a cache- inherit support for all the alternatives.
enabled application, there exists a trace in the original appli-
Dependencies between client requests. MuCache’s caching
cation without caches, such that all the client events in the
protocol treats client requests as independent and allows them
two traces are equivalent modulo reordering.
to be reordered, processing reads and writes from different
Proof intuition. To show the theorem, we prove a stronger clients without synchronization. However, this might not al-
lemma, namely that for all cache-enabled traces, we can con- ways be desirable, e.g., when a client request expects to see
struct an original trace where (1) all request subtraces are the the effects of a previous request. To support this, we extend
same (modulo the missing requests due to cache hits), and (2) MuCache’s dependencies (Sec. 4) to client requests. Specifi-
that the application state is the same at the end of both traces. cally, when a client request is complete, visited services are
The proof proceeds by induction on the length of traces and included in the result, and passed to the subsequent request
has three phases: (1) given a trace in the cache-enabled ap- of the same client (if one is performed), allowing MuCache
plication that ends with a cache-hit, it uses assumption (A2) to avoid violating dependencies across client requests.
to move writes that happened before the cache-hit but would Supporting third-party services. Microservice applications
later invalidate its entry to the end of the trace (together with often perform requests to third-party services that might not
their dependencies); (2) it then uses the inductive hypothesis be extensible with MuCache, e.g., if they are owned by a dif-
to construct a trace in the original application for the prefix ferent organization. To support such applications, MuCache
up to the cache-hit; and (3) it uses assumption (A1) to fill in allows declaring requests to third-party services as read-only
all subrequest events that are missing due to the cache-hit, using a TTL, saving their values to the cache on return, but
and then it fills in the writes and all their dependencies (A3), invalidating them when the TTL has passed instead of wait-
ending up with a trace that satisfies the requirement. ing for a downstream cache manager. This setup provides
caching benefits with at least as strong guarantees as if all
6 Implementation the caches in the application were configured with a TTL,
The MuCache implementation comprises roughly 2k LoC of however for the complete subtrees of the microservice graph
Go [12], including the wrappers that intercept invocations and that are MuCache-enabled the guarantees are stronger.
state accesses, and the cache manager that makes invalidation
and saving decisions. Communication between wrappers and
7 Evaluation
the cache manager happens with ZeroMQ [16] and between Our evaluation aims to answer these high-level questions:
cache managers with HTTP. Our current implementation uses • (Q1) Throughput and latency benefits: Does MuCache
Redis [9] as the cache, but any in-memory store could be used provide throughput and latency benefits compared to other
in its place. We use 32-bit FNV-1a [11] algorithm to compute caching alternatives? Does it scale with sharding? How do
the hash values of call arguments. cache sizes affect its performance? How are its benefits
Batching. Cache managers instruct their upstream counter- affected by the application call-graph? (§7.3)
parts to save or invalidate cache entries by sending HTTP • (Q2) Costs: What are the costs of deploying MuCache?
requests that might become a bottleneck when the load is What is its CPU and memory usage, total network costs,
high. To increase throughput at high loads without affecting and its latency overhead on the critical path? Does the
correctness, MuCache allows batching requests that are sent cache manager throughput become a bottleneck? (§7.4)
upstream. At low loads, batching increases the time it takes • (Q3) Invalidation: How fast can MuCache invalidate
for an invalidation to propagate through the system based on cache entries? (§7.5)
the batching timeout, which is currently set to 1ms. Batching Before we answer these questions, we describe the experimen-
also enables the simplification of upstream requests by can- tal setup (§7.1) and our methodology and baselines (§7.2).
celing out operations at the sender, i.e., invalidates and saves
override previous invalidates and saves on the same key. This 7.1 Experimental Setup
reduces the size of requests and the number of operations We deploy a Kubernetes [7] cluster on CloudLab [2] m510 ma-
upstream cache managers have to process, while incurring chines that have 8-core 2.0 GHz CPUs, 64GB RAM, 256GB
minimal cost since it requires a single pass over the batch. NVMe SSDs, and 10GB NICs. Machines run Ubuntu 20.04.
General support. MuCache is designed to not be limited The average round-trip time between servers is 0.15ms. Ex-
USENIX Association 21st USENIX Symposium on Networked Systems Design and Implementation 227
cept for sharding experiments, we utilize a single Kubernetes Benchmark Services LoC RO/NonRO Sources
cluster where the number of worker nodes is equal to the
1 SocialMedia 6 532 90/10 [10, 24, 32]
number of services, plus one node acting as a control plane. 2 MovieReview 12 913 90/10 [13, 24]
Each service is deployed via Dapr [3] and is affinitized to a 3 HotelRes 6 608 80/20 [24]
single node. We use Redis [9] configured with an LRU evic- 4 OnlineBoutique 9 1,088 75/25 [8]
tion policy as the cache. Unless otherwise noted, MuCache is
configured with a sending batch size of 20 and a 1 ms timeout. F IGURE 8— Real-world applications used in our evaluation.
Cache manager dependencies are stored in a LRU cache, with
the maximum number of entries being proportional to the
user cache size. In our experiments, the cache manager stores
100 dependencies per 1 MB of user cache (e.g., a user cache
of 20 MB allows the storage of 2,000 dependencies).
F IGURE 9—Shapes of synthetic benchmark call-graphs.
7.2 Applications, Method, and Baselines
fan-out, and fan-in. ChainApp has four stateless services and
Throughout our evaluation we perform experiments on four a stateful backend. FanoutApp has a single frontend forward-
open-source microservice applications, as detailed in Figure 8, ing requests to four backends. FaninApp has four separate
along with four synthetic ones. Workloads are adapted from frontends, each forwarding requests to one backend.
the original testbeds, including the dataset and request distri-
Method. We measure throughput and latency (median and
bution. Cache sizes are set relative to the application working
95th percentile) using the wrk2 [15] HTTP benchmarking
data set; small enough that they do not fit the entire work-
tool. Experiments include a 30-second cache pre-warming
ing data set but big enough so that there is a non-negligible
period, followed by a 60-second testing period. Each experi-
amount of cache-hits.
ment is run three times, and the average is reported. We run
SocialMedia. A social network application (Cf. Twitter or MuCache and the baselines with the same CPU resources;
Facebook) that provides three main endpoints, viewing a that is, MuCache’s cache managers are not given extra cores
user’s homepage timeline (RO), viewing a user’s personal but share resources with the application.
timeline (RO), and composing a post. The workload ratio is
Baselines. We compare MuCache to the following baselines.
60% homepage, 30% user timeline, and 10% compose post.
The cache size for each service is set to 20 MB. When there BC (Backend Cache): A baseline that lacks inter-service
are no new posts and each timeline contains 10 posts, the total caching and only caches data from the backend datastore.
cacheable posts are around 20 MB. TTL: A baseline that reflects the current best practices for
MovieReview. A movie review application (Cf. IMDB or automated inter-service caching [1, 27, 33]. Caching occurs at
Rotten Tomatoes) that offers two main endpoints: viewing the both the backend and intermediate services. Upon invocation,
page of a movie (RO) and creating a review. The workload the caller saves the result in the cache asynchronously without
ratio is 90% viewing a page and 10% creating reviews. The communicating with any cache manager. The caches can then
cache size for each service is set to 70 MB. evict an entry when they become full or, in the case of an inter-
HotelRes. A hotel reservation application (Cf. Booking or service cache, when a configured time-to-live (TTL) timer
Airbnb) that offers two main endpoints: searching for hotels has expired. Cached data can be inconsistent and arbitrarily
in a specific area (RO) and making a reservation. The work- stale (depending on the TTL and access pattern).
load ratio is 80% searching for hotels and 20% making a TTL-∞: A special case of TTL that serves as an upper bound
reservation. The cache size for each service is set to 20 MB. on the performance achievable by TTL implementations;
OnlineBoutique. An online store application (Cf. Amazon cache entries never expire and are only evicted when the
or Walmart) that offers multiple endpoints, retrieving the cache reaches maximum capacity.
store homepage (RO), updating the currency rate, viewing a
product (RO), adding a product to the cart, and checking out.
The workload ratio is 75% read-only (homepage, viewing 7.3 (Q1) Throughput and Latency Benefits
products, and carts) and 25% non-read-only (updating the
currency, updating the cart, checking out). The cache size for We first measure the throughput and latency of a set of real-
each service is set to 80 MB. world applications with and without MuCache (§7.3.1). We
Synthetic Benchmarks. Figure 9 shows four synthetic ap- then compare it against different TTL baselines (§7.3.2), we
plications: ProxyApp, a two-service app where a stateless evaluate whether it limits throughput scalability in the pres-
frontend forwards requests to the backend, which in turn ence of sharding (§7.3.3), and we evaluate whether configur-
reads/writes to a key-value store; and three applications that ing caches with different sizes and whether different applica-
extend ProxyApp with archetype call-graph patterns—chain, tion call-graphs affect MuCache’s benefits (§7.3.4–7.3.5).
228 21st USENIX Symposium on Networked Systems Design and Implementation USENIX Association
HotelRes MovieReview SocialMedia OnlineBoutique
200
100
Latency (ms)
50
20
10
5
1,000 2,000 3,000 4,000 1,000 2,000 3,000 4,000 500 1,000 1,500 2,000 2,000 3,500 5,000 6,500
Request Rate (rps) Request Rate (rps) Request Rate (rps) Request Rate (rps)
BC 50/95th MuCache 50/95th TTL-∞ 50/95th
Latency (ms)
50
We evaluate MuCache’s benefits on throughput and latency on
the four open-source microservice applications. We compare 20
MuCache against (1) BC to evaluate performance benefits 10
over not having inter-service caches, and (2) TTL-∞ to eval- 1,000 1,500 2,000 2,500 3,000 3,500 4,000
uate how close MuCache is to an implementation that caches Request Rate (rps)
results but provides no consistency guarantees.
TTL-0.1s 50/95th TTL-1s 50/95th
Results. Figure 10 shows the results, where the X-axis is TTL-10s 50/95th MuCache 50/95th
request rate, and the Y-axis shows latency in ms. MuCache
F IGURE 11—HotelRes: Latency and throughput of MuCache com-
reduces median latency by up to 1.8× in HotelRes, 2.5× in
pared with various TTL.
MovieReview, 1.5× in SocialMedia, and 2.1× in OnlineBou-
tique. The tail latency between MuCache and BC is similar, performs TTL-1s (1.3× lower median latency), but is outper-
except for OnlineBoutique, where MuCache reduces tail la- formed by TTL-10s (which performs similarly to TTL-∞).
tency by up to 1.8× by avoiding many invocations from the Take away. Getting comparable performance to MuCache
Checkout service, such as retrieving product information, get- with a TTL-based caching approach requires setting the TTL
ting shipping quotes, etc. Furthermore, MuCache improves to a high value (>1 s)—orders of magnitudes higher than
throughput by 1.6× in HotelRes, 1.5× in MovieReview, and the MuCache invalidation times (on the order of ms per call-
1.4× in SocialMedia, while achieving similar throughput in graph depth as shown in Section 7.4.3). Furthermore, finding
OnlineBoutique. Compared to TTL-∞, MuCache’s median an appropriate TTL value is challenging for developers, as this
latencies are up to 1.2× higher before saturation, and Mu- value has implications for the correctness of the application.
Cache’s throughput is around 0.95×. In contrast, MuCache requires no tuning of expiration times,
Take away. MuCache outperforms BC in terms of median and and invalidations happen automatically and correctly.
tail latency, and throughput across all workloads. MuCache
also performs close to the upper bound TTL-∞. Improve- 7.3.3 Sharding Scalability
ments in median latency can be attributed to cache hits, while We evaluate the scalability of MuCache by deploying So-
improvements in throughput are due to lower utilization of cialMedia to multiple shards. We provision a fixed pool of
backend services. machines and restrict each shard to a fixed CPU usage of
2 cores (1 running the service, 1 running the Dapr sidecar)
7.3.2 Comparison with TTL baselines to have multiple shards on a single machine. Each shard is
Tuning TTL values for caches in real systems is complex and deployed with its own cache manager. We compare against
depends on the application requirements; suggested values BC to determine whether MuCache limits scalability.
could range from seconds to hours [18, 23]. To simulate that Results. Figure 12 shows the maximum throughput of the
in a shorter experiment, we vary TTL from 100 ms to 10 s— SocialMedia when deployed using 1, 2, and 4 shards, with and
values under 100 ms lead to negligible cache hits, and a TTL without MuCache. MuCache scales as well as BC (achieving
of 10 s is already a large fraction of the total experiment (60 s). 1.44×, 1.38×, and 1.37× the throughput of BC).
Results. Figure 11 shows the results. As the TTL increases Take away. MuCache does not limit scalability for sharded
from 0.1 to 10 s, median latency drops from 18.2 ms to applications as the only cost occurs in the background; when
10.9 ms, tail latency drops from 29.3 ms to 10.9 ms, and the cache manager of a shard broadcasts received writes to all
throughput increases from 2,489 to 3,470 rps. MuCache out- cache managers that belong to the same service shards.
USENIX Association 21st USENIX Symposium on Networked Systems Design and Implementation 229
1,500 1.37x Chain Fanout Fanin
Baseline 200
T-put (r/s)
Latency (ms)
1.38x 100
1,000 MuCache 50
1.44x 20
500 10
5
1 shard 2 shards 4 shards
1 3 5 7 1 1.5 2 2.5 3 2 6 10 14
F IGURE 12—Throughput of MuCache and BC when sharding the Request Rate (krps)
services in SocialMedia.
BC 50/95th MuCache 50/95th
25 1
Latency (ms)
20 0.8
Hit rate
15 0.6 F IGURE 14—Latency and throughput of the graph shape mi-
10 0.4 crobenchmarks (Fig. 9).
5 0.2
0 0 Benchmark Average (MB) Max (MB) Cache Size (MB)
16 32 64 128 256 512 1024
Cache size (MB) 1 HotelRes 0.08 0.27 20
MuCache 50/95th MuCache hit rate
2 MovieReview 0.07 0.31 70
TTL-∞ 50/95th TTL-∞ hit rate 3 SocialMedia 0.02 0.09 20
4 OnlineBoutique 0.1 0.45 80
F IGURE 13—HotelRes: Impact of different cache sizes on latency
(left Y-axis) and combined cache hit rate (right Y-axis). F IGURE 15—Cache manager state and cache size for each service.
7.3.4 Cache size effect improves tail-latency but not median latency since the fron-
tend has to wait for the slowest path to respond; and when the
To evaluate how MuCache responds to the cache size of each
backend is the bottleneck it improves throughput by reducing
service, we measure latency and cache hits on HotelRes with
the number of requests that reach the backend.
a fixed load of 1K req/s while varying the cache size from
16 MB to 1024 MB. TTL-∞ acts as an upper-bound baseline. 7.4 (Q2) MuCache costs and overheads
Results. Figure 13 shows the results. Increasing the cache In order to evaluate the costs of MuCache, we measure its
size lowers the median latency of MuCache from 9.9 ms to CPU, memory, and network usage (§7.4.1), its latency over-
8.2 ms and tail latency from 22 ms to 13.6 ms; it also increases head on the critical path (§7.4.2), and the cache manager’s
the cache hit rate from 5% to 91%. Similarly, in TTL-∞, the throughput and whether it can be a bottleneck (§7.4.3).
median latency decreases from 9.9 ms to 7.3 ms, tail latency
from 21.6 ms to 10.6 ms, and cache hit rate from 5% to 100%. 7.4.1 Memory / CPU / Network costs
Take away. Caching with MuCache reduces mean and tail We evaluate MuCache’s memory cost on all four applications
latency. Furthermore, the reductions achieved by MuCache and its CPU and network usage on HotelRes. We evaluate
are close to those achieved by TTL-∞ across all cache sizes. MuCache’s network usage by measuring data transfer be-
7.3.5 Application call-graph effect on performance tween nodes using iftop. We measure the memory cost of
each cache manager instance as the average size of its state
To evaluate how the application call-graph pattern affects the (history and dependencies) and CPU cost as the average CPU
benefits of MuCache, we use the three synthetic applications usage of each service during the experiment. We use standard
in Figure 9. We use a synthetic workload with 50% cache hit cache sizes and load (2K req/s for HotelRes, 2.5K req/s for
rates and compare against BC. MovieReview, 1K req/s for SocialMedia, and 3.5K req/s for
Results. Figure 14 shows the results. For ChainApp, Mu- OnlineBoutique) for 300 seconds.
Cache’s median latency is 2.6–3.1× lower than that of BC, Results. Figure 15 shows the cache manager state size and the
while its tail is comparable before reaching saturation. Its cache size across services. The average size of the CM state
maximum throughput is 1.5× higher. For FanoutApp, the across services ranges from 0.1–0.4% of the cache size per
median latency and maximum throughput of MuCache are service. Garbage collection plays an important role in keeping
similar to that of the BC, but its tail latency is up to 1.6× the memory usage low: without GC, the CM state in HotelRes
lower. In FaninApp, MuCache improves median latency by goes up to 5 MB in 1 minute. Figure 16 shows the average
1.1–1.3× and 95th percentile latency by up to 1.9×; maxi- CPU usage of each service during the experiment. Usage is
mum throughput is 1.75× higher than BC. broken down between the service logic, the Dapr sidecar, and
Take away. MuCache provides different benefits depending the cache manager. The average CPU usage across services
on the call-graph shape. For long call-chains MuCache re- with and without MuCache is 4.2 and 5.1 cores respectively.
duces latency by avoiding network hops; for fan-out it slightly The average CM CPU usage across services is 0.5 cores. The
230 21st USENIX Symposium on Networked Systems Design and Implementation USENIX Association
Baseline MuCache Batch Size 1 2 5 10 20 50
CPU (cores)
Re t e
Se rv.
Pr nt.
Re te
Se rv.
r
Rale
U h
Rale
U h
se
se
c
c
ofi
ofi
ar
ar
se
se
o
o
Fr
4 7
7.4.3 MuCache’s throughput
3 6
2 5 To determine whether MuCache’s cache manager can be a
1 4
bottleneck, we measure the its maximum throughput on the
ProxyApp and load the backend’s cache manager directly be-
0
0.2 0.5 0.7 0.8 0.9 0.95 0.975 0.99 cause the backend service becomes the bottleneck otherwise.
Percentile Percentile The load is 80% read-only requests and we vary the batch
BC MuCache (0%/60%) TTL-∞ (0%/60%) size of the HTTP sending buffer between cache managers.
Results. Figure 18 shows the throughput in terms of the
F IGURE 17—Latency distribution w.r.t. hit-rate for ProxyApp. Solid
number of events the cache manager processes per second.
and dashed lines show the latencies when the hit rate is 0% and 60%
Without batching, the cache manager has a throughput of
respectively. Split in 70th percentile for clarity.
∼19K events per second, while gradually increasing the batch
average network usage per service without MuCache is 9.0 size up to 20 improves it to ∼75K events per second.
MB/s, while the average with MuCache is 6.6 MB/s, of which Take away. The cache manager has a reasonably high
cache managers use 2.9 MB/s. throughput and is not the bottleneck even for an application
Take away. Memory costs are low compared to the cache with minimal computation. To further increase throughput,
size (<0.4% on average). The CPU usage of MuCache is developers may deploy multiple shards for each service.
13% of the total service CPU on average while at the same
7.5 (Q3) Invalidation time
time reducing the total CPU usage of the whole application
due to some backend services being less utilized because of We evaluate the time needed for invalidations to reach the root
cache hits in the frontend. Though cache managers use some of the call-graph, namely the frontend service, by measuring
bandwidth to save/invalidate caches, MuCache reduces the the observed inconsistency window [19], the elapsed time
total network usage by 27% due to local cache hits. between the write happening in the backend and the inval-
idation becoming visible in the frontend. The invalidation
7.4.2 MuCache latency overhead time in our experiment is determined solely by the depth of
the call graph. To measure the increase in invalidation time
We evaluate MuCache’s latency overhead by focusing on
per hop, we conducted experiments on a microservice chain
ProxyApp, which performs minimal work, to measure the
consisting of 2 to 5 services, which represents the typical
worst-case overhead. We create a synthetic workload with
depths of call-graphs in the applications that we studied.
0% and 60% cache hit rates and compare against (1) BC to
evaluate the overhead over no caches when there are no hits, Results. Figure 19 shows the results. For a two-service ap-
and (2) TTL-∞ to evaluate the wrapper overhead. plication, the invalidation time is ∼4 ms; for a five-service
application it is ∼10 ms. Each additional service in the chain
Results. Figure 17 shows the complete request latency distri-
increases invalidation time by ∼2.2 ms.
bution. We report overheads as absolute values because they
are constant and independent of the work that the services Take away. MuCache’s invalidation time is ∼2.2 ms per
do. For a hit rate of 0%, MuCache’s median latency (4 ms) is call-graph hop—orders of magnitude smaller than the typical
0.5 ms higher than BC and 0.3 ms higher than TTL-∞, while invalidation times observed in TTL-based approaches (which
the 95th-percentile (5.7 ms) is 0.9 ms and 0.5 ms higher re- range from seconds to hours [18, 23]).
spectively. When the hit rate is 60%, MuCache’s median and
95-th percentile latencies are 0.15 ms and 0.5 ms higher than 8 Related Work
TTL-∞. When the hit rate is 60%, MuCache median latency Caching in microservice applications. Several works study
is 1.4 ms better than BC (3.5 ms to 2.1 ms). cache usage in real-world microservices, including work from
Take away. Even in a worst-case scenario (an application that Alibaba [28], Twitter [38], and Facebook [37]. These papers
USENIX Association 21st USENIX Symposium on Networked Systems Design and Implementation 231
confirm that caches are heavily used in microservice applica- would be more challenging since caches should not violate
tions and provide significant performance benefits, but only transactional guarantees, which would require additional syn-
mention manual, ad-hoc, or inconsistent coherence schemes chronization in the protocol. Supporting non-KV stores, such
and do not propose an automatic way to manage these caches. as relational databases, would require monitoring the depen-
Caching frameworks for web services. There is a lot of dencies of read-only calls and determining when to invalidate
work on caching frameworks for web services for both static cache entries, which could be done by leveraging the expres-
and dynamic data. These frameworks focus on three key as- sive semantics of SQL (as in the case of Noria [25]).
pects: (1) content admission, (2) cache size management, and Supporting weaker consistency datastores. The correctness
(3) invalidation and data freshness (for a more detailed clas- of MuCache depends on the datastores being linearizable;
sification see a recent survey [29]). The first two aspects are MuCache needs to be sure that after a write has completed,
orthogonal to our work since we do not focus on optimizing it has taken effect in the database. Being able to determine
the performance of a cache given a specific workload, but the order of reads and writes by intercepting the datastore ac-
rather propose a general system for keeping caches coherent cesses is necessary so that MuCache is database-agnostic (see
in a microservice setting. To the best of our knowledge, all requirements in Section 3). Supporting weaker consistency
frameworks that focus on invalidation (e.g., [21, 22, 31]) are datastores would likely require a more intrusive design with
designed as a single cache layer on top of a database without modifications to a datastore—tightly integrating wrappers in
taking into account the inter-service caching. the store to provide additional metadata to the cache managers
Cache coherence protocols. There is extensive literature on about the precise order of reads and writes—forfeiting the
cache coherence protocols (see survey [34]), none of which generality of being database-agnostic.
considers inter-service caching. Lazy caching [17] exploits Application debuggability. Extending an application with
the fact that writes do not always require exclusivity (M or MuCache provides performance benefits and does not affect
E in MOESI [35]), allowing cores to perform concurrent the application behavior but adds complexity to the end-to-
buffered writes, albeit blocking reads to ensure that depen- end deployment and therefore increases the effort required
dencies are not violated. Our work extends this insight by to maintain and debug it. This is an inherent software en-
avoiding all blocking communication on the request’s criti- gineering challenge—the bigger a codebase is, the harder
cal path—allowing writes downstream without immediately it is to maintain it. A direction for future work that could
informing the upstream caches and without blocking on reads. help address this is to integrate MuCache with existing dis-
Incremental computation. Caches are also used to enable in- tributed tracing and debugging tools for microservices, so that
cremental and reactive computation: some examples include engineers have visibility on MuCache’s state and actions.
Reactive Caching [20], Noria [25], and Diamond [39]. Reac- Write-intensive workloads. Even though a service might
tive Caching proposes caches for graphs of single-threaded offer a read-only endpoint, its workload might be write-
services to support reactive computation, i.e., writes down- intensive, leading to overheads without the accompanied ben-
stream are propagated upstream to refresh the results. Noria is efits if extended with MuCache. Developers can currently
an incremental stream processing engine that uses caches for manually detect such cases and avoid declaring those end-
fast propagation of updates in a dataflow. Both differ from our points as read-only, but it would be interesting to explore
work in two ways: (1) they only provide eventual consistency whether MuCache can be extended with an adaptive moni-
that violates dependencies when there are multiple paths be- toring mechanism that only enables caching if the read-write
tween two services (see Fig. 4); and (2) they do not support ratio of a service is above some threshold.
true multi-threading, as Noria limits writes to a single thread Sharding. MuCache requires hard affinity sharding of read
and Reactive Caching only supports single-threaded services. requests to ensure correctness, i.e., all read-only calls with
Diamond is a system that automates data management for the same arguments need to be processed by the same shard.
distributed reactive applications by providing reactive trans- Write requests have no such limitation and can be dispatched
actions to clients. Similarly to MuCache, Diamond reactively to any shard. An interesting avenue for future research would
informs clients about data invalidations in the backend store, be to lift the requirement for hard affinity, allowing for more
but in contrast to MuCache it does not support service graphs. flexible load balancing and autoscaling.
232 21st USENIX Symposium on Networked Systems Design and Implementation USENIX Association
References ceedings of the ACM SIGMOD Conference (SIGMOD),
2001.
[1] Caching Guidance - Azure Architecture [22] Jim Challenger, Arun Iyengar, and Paul Dantzig. A scal-
Center. https://learn.microsoft.com/en- able system for consistently caching dynamic web data.
us/azure/architecture/best-practices/caching. In Proceedings of the IEEE International Conference
[2] CloudLab - A testbed for cloud computing research. on Computer Communications (INFOCOM), 1999.
https://www.cloudlab.us/. [23] Cloudflare. Edge and Browser Cache TTL. https:
[3] Dapr - Distributed Application Runtime. https:// //developers.cloudflare.com/cache/how-
dapr.io/. to/edge-browser-cache-ttl/, 2023.
[4] Envoy Proxy. https://www.envoyproxy.io/. [24] Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty,
[5] From Monolith to Microservices: How to Scale Your Ar- Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu,
chitecture. https://www.youtube.com/watch?v= Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna
N1BWMW9NEQc. Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang
[6] Istio Service Mesh. https://istio.io/latest/ Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky,
about/service-mesh/. Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla,
[7] Kubernetes - An open-source container orchestration and Christina Delimitrou. An open-source benchmark
system. https://kubernetes.io/. suite for microservices and their hardware-software im-
[8] Online Boutique – Microservices Demo. plications for cloud & edge systems. In Proceedings of
https://github.com/GoogleCloudPlatform/ the International Conference on Architectural Support
microservices-demo. for Programming Languages and Operating Systems
[9] Redis - An open-source in-memory data store. https: (ASPLOS), 2019.
//redis.io/. [25] Jon Gjengset, Malte Schwarzkopf, Jonathan Behrens,
[10] Rutgers Social Network Graph. https:// Lara Timbó Araújo, Martin Ek, Eddie Kohler, M Frans
networkrepository.com/socfb-Rutgers89.php. Kaashoek, and Robert Tappan Morris. Noria: dynamic,
[11] The FNV Non-Cryptographic Hash Algorithm. partially-stateful data-flow for high-performance web
https://datatracker.ietf.org/doc/html/ applications. In Proceedings of the USENIX Sympo-
draft-eastlake-fnv-17.html. sium on Operating Systems Design and Implementation
[12] The Go programming language. https://go.dev/. (OSDI), 2018.
[13] The Movie Database. https://www.themoviedb. [26] Maurice P. Herlihy and Jeannette M. Wing. Linearizabil-
org/. ity: A correctness condition for concurrent objects. ACM
[14] Twitter’s recommendation algorithm. https:// Transactions on Programming Languages and Systems
github.com/twitter/the-algorithm. (TOPLAS), 12(3), July 1990.
[15] wrk2: A constant throughput, correct latency record- [27] Joydip Kanjilal. Scaling microservices architecture us-
ing variant of wrk. https://github.com/giltene/ ing caching. https://www.developer.com/design/scaling-
wrk2. microservices-using-cache/, 2021.
[16] ZeroMQ - An open-source universal messaging library. [28] Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye,
https://zeromq.org/. Guoyao Xu, Liping Zhang, Yu Ding, Jian He, and
[17] Yehuda Afek, Geoffrey Brown, and Michael Merritt. Chengzhong Xu. Characterizing microservice depen-
Lazy caching. In ACM Transactions on Programming dency and performance: Alibaba trace analysis. In Pro-
Languages and Systems (TOPLAS), 1993. ceedings of the ACM Symposium on Cloud Computing
[18] AWS. Caching Best Practices. https://aws.amazon. (SOCC), 2021.
com/caching/best-practices/, 2023. [29] Jhonny Mertz and Ingrid Nunes. Understanding
[19] David Bermbach and Stefan Tai. Eventual consistency: application-level caching in web applications: a com-
How soon is eventual? An evaluation of Amazon S3’s prehensive introduction and survey of state-of-the-art
consistency behavior. In Workshop on Middleware for approaches. In ACM Computing Surveys (CSUR), 2017.
Service Oriented Computing (MW4SOC), 2011. [30] Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc
[20] Sebastian Burckhardt and Tim Coppieters. Reactive Kwiatkowski, Herman Lee, Harry C Li, Ryan McElroy,
caching for composed services: polling at the speed of Mike Paleczny, Daniel Peek, Paul Saab, et al. Scaling
push. In Proceedings of the ACM SIGPLAN Conference memcache at facebook. In Proceedings of the USENIX
on Object-Oriented Programming Systems, Languages Symposium on Networked Systems Design and Imple-
and Applications (OOPSLA), 2018. mentation (NSDI), 2013.
[21] K Selçuk Candan, Wen-Syan Li, Qiong Luo, Wang-Pin [31] Dan RK Ports, Austin T Clements, Irene Zhang, Samuel
Hsiung, and Divyakant Agrawal. Enabling dynamic Madden, and Barbara Liskov. Transactional consistency
content caching for database-driven web sites. In Pro- and automatic management in an application data cache.
USENIX Association 21st USENIX Symposium on Networked Systems Design and Implementation 233
In Proceedings of the USENIX Symposium on Operating
Systems Design and Implementation (OSDI), 2010.
[32] Ryan A. Rossi and Nesreen K. Ahmed. The network
data repository with interactive graph analytics and vi-
sualization. In Proceedings of the AAAI conference on
artificial intelligence (AAAI), 2015.
[33] Irfan Saleem, Pallavi Nargund, and Peter Buonora. Data
caching across microservices in a serverless architecture.
https://aws.amazon.com/blogs/architecture/data-
caching-across-microservices-in-a-serverless-
architecture/, 2008.
[34] Per Stenstrom. A survey of cache coherence schemes
for multiprocessors. In IEEE Computer, 1990.
[35] Paul Sweazey and Alan Jay Smith. A class of com-
patible cache consistency protocols and their support
by the ieee futurebus. In ACM SIGARCH Computer
Architecture News (SIGARCH), 1986.
[36] Alex Xu. Twitter architecture 2022 vs. 2012. what’s
changed over the past 10 years?, Nov 2022.
[37] Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike
Paleczny. Characterizing Facebook’s Memched Work-
load. In IEEE Internet Computing, 2013.
[38] Juncheng Yang, Yao Yue, and KV Rashmi. A large
scale analysis of hundreds of in-memory cache clusters
at Twitter. In Proceedings of the USENIX Symposium on
Operating Systems Design and Implementation (OSDI),
2020.
[39] Irene Zhang, Niel Lebeck, Pedro Fonseca, Brandon Holt,
Raymond Cheng, Ariadna Norberg, Arvind Krishna-
murthy, and Henry M Levy. Diamond: Automating data
management and storage for wide-area, reactive appli-
cations. In Proceedings of the USENIX Symposium on
Operating Systems Design and Implementation (OSDI),
2016.
[40] Zhizhou Zhang, Murali Krishna Ramanathan, Prithvi
Raj, Abhishek Parwal, Timothy Sherwood, and Milind
Chabbi. CRISP: Critical path analysis of Large-Scale
microservice architectures. In Proceedings of the
USENIX Annual Technical Conference (ATC), 2022.
234 21st USENIX Symposium on Networked Systems Design and Implementation USENIX Association
A Detailed Protocol Correctness {Calli (ca, i′ ) : ∀i, ca, i′ } that are events that are determined
by the program when processing a single request, and input
Preliminaries. We start with some basic notation: events ΣI = {Reqi (ca) : ∀i, ca} ∪ {Respi (v, i′ ) : ∀i, v, i′ }
• nS denotes a service name that are events that are given as inputs to the processing of a
single request. Finally, we can define the set of client events
• ca denotes the arguments of a service call, including ΣC = {Reqi (v) : ∀i, client(i)} ∪ {Reti (v) : ∀i, client(i)}
the name of the service (which can be extracted using We can now describe complete executions of microservice
name(ca)) and the endpoint. applications using traces t, i.e., sequences of the above events.
We can project all events of a trace t from a particular set Σ
• i ∈ R denotes request identifiers (each request has a using t[Σ], e.g., t[ΣW ] are all the write events in a trace. Note
unique i). The service name and the arguments to the that this projection creates an ordered sequence of events by
request can be extracted using name(i) and ca(i). We maintaining the trace order.
will define a binary relation sr ⊆ R × R that determines
Applications and Assumptions. We can now define the be-
when a request is spawned by another request. We can
havior of a microservice application P ∈ P using its execu-
also define sr∗ as the reflexive transitive closure of sr.
tion traces, JP K ⊆ Σ∗ , and state some assumptions on these
There is also a client(i) predicate, which returns true for
traces. First of all, an application determines the processing
requests that are initiated by a client.
of each request using the step : P × R × Σ∗ × (ΣO ∪ {⊥})
• v denotes a return value relation, that determines the next step of the processing of a
request, or ⊥ if the request is waiting for a response or hasn’t
• k denotes a key that indexes values in the state of a started yet. Now we define what it means for a trace to be
service well formed.
• rs(i, t) is a function that returns all of the keys that a Property 2 (Well-formed traces). All traces t ∈ JP K are well-
particular request (and all of its subrequests) have read formed, i.e., for each trace t the following properties hold:
in trace t. We will often ignore t when it is obvious (1) Reqi (ca) are the first events for any request i and Reti (v)
which trace we refer to. are the last; (2) for each i ∈ t there exists a unique Reqi (ca)
and at most one Reti (v); (3) a Reqi (ca) always comes after
Events and traces. We will describe microservice applica- a Calli (ca, i′ ) except in the case of client requests; (4) a
tions and their executions using traces, i.e., sequences of Respi (v, i′ ) always comes after a Calli (ca, i′ ) and Reti′ (v);
events that describe application actions. We are only inter- (5) for all Calli (ca, i′ ), sr(i, i′ ); and for all prefixes t′ =
ested in events that describe interactions between services t0 .e with e ∈ ΣI , either step(P, i, t0 , e) or step(P, i, t0 , ⊥);
and other services and actions on their states. We call the set (6) for all e ∈ ΣC for any i ∈ t, s.t. client(i) holds, then
of all events Σ, and we now define all events in it. ∄Calli′ (ca′ , i) ∈ ΣC .
• Reqi (ca) denotes the start of processing of a single re- The last requirement relates the step relation with the traces,
quest with id i and arguments ca. i.e., each event in the trace is the result of stepping a request
• Reti (v) denotes that a request with id i has finished or a request start or response. We also know that the events in
processing and is returning value v. a trace are equivalent up to an injective renaming of request
identifiers.
• Readi (k, v) denotes that request with id i performed a
read of key k from its state and returned v. Property 3. For any microservice application P , for all
traces t ∈ JP K and for all i ∈ t, then for any i′ ∈ / t we can
• Writei (k, v) denotes that request with id i performed a construct a new trace t′ = t [i 7→ i′ ], s.t. t′ ∈ JP K.
write with value v to key k of its state.
In addition to the above, we also know that requests are
′ always enabled in microservice applications, i.e., a pending
• Calli (ca, i ) denotes that request with id i performed a
call to another service with arguments ca and the request request can always take a step.
id of that internal request is i’.
Definition 1 (Pending Requests). We say that a request
• Respi (v, i′ ) denotes that request with id i received a Reqi (ca) is pending in a trace t iff Reti (v) does not exist
response with value v from a finished call with id i’. in t.
We represent the set of all events for a request with identi- Property 4 (Request Step Always Enabled). For any mi-
fier i as Σi and the set of all read (or write) events as ΣR (or croservice application P , for all traces t ∈ JP K, and for all
ΣW ). We also define a set of output events ΣO = {Reti (v) : pending requests Reqi (ca) for some i, there exists a trace
∀i, v} ∪ {Readi (k, v) : ∀i, k, v} ∪ {Writei (k, v) : ∀i, k, v} ∪ t′ ∈ JP K such that t′ = t.ei , where ei ∈ Σi′ and sr∗ (i, i′ ).
USENIX Association 21st USENIX Symposium on Networked Systems Design and Implementation 235
Property 4 means that requests are always enabled to take protocol does not affect the stepping of requests other than
a step, sometimes through their subrequests. This is a valid allowing some calls to return immediately with call hits. We
assumption for microservice applications since they are mul- can also lift the step relation to account for cache-enabled
tithreaded, and therefore a single request can not block other applications. The lifted step relation describes the logic of our
requests from proceeding, and a request can only block while cache coherence protocol.
waiting for a response from its subrequests. Note that this
Property 6 (Cache Stepping). For any application P the
assumption requires that the network does not drop requests,
transformed Pe can step, i.e. step(Pe, i, t, e) holds, if
i.e., calls eventually lead to request starts and that returns
eventually lead to response events. • step(P, i, t, e) when e ∈ Σ or
We also know that the values of read events depend on the
latest write to the same key or the original value. • e = Save(nS , i′ , v) and ∃Reti′ (v) ∈ t or
Property 5 (Read return value). For all applications P , re- • e = Inv(nS , i′ , i′′ ) and ∃Writei′ (k, v) ∈ t with k ∈
quest identifiers i and traces t s.t. step(P, i, t, Readi (k, v)) rs(i′′ ).
holds, then either ∃i′ , Writei′ (k, v) = last(t[ΣW (k) ]) or • e = Inv(nS , i′ , i′′ ) and ∃Inv(nS , i′ , i′′′ ) ∈ t with
v = ⊥. ca(i′′′ ) ∈ rs(i′′ ).
Intuitively, this means that writes are immediately visible to • e = CacheHiti (ca, v) and ∃Save(name(i), i′ , v) ∈ t
reads, thus that the underlying stores are linearizable, which and ∄Inv(name(i), i′′ , i′′′ ) ∈ t and ca = ca(i′ ) =
is a valid assumption for most key value stores. ca(i′′′ ) between the save and the cache-hit.
We can now define read-only calls, that is calls that never
perform writes (even in their subrequests). Intuitively, Property 6 means that the cache-enabled appli-
cation does not affect the next steps of any specific request
Definition 2 (Read-only requests). Given an application P other than sometimes finding a result in the cache.
a request with request identifier i and call arguments ca,
i.e. ca(i) = ca, is read-only for this application iff for all Definition 3 (Dependency). We say that event e′ ∈ Σi′ is a
traces t ∈ JP K, and for all i′ such that sr∗ (i, i′ ), it holds that dependency of e ∈ Σi in a trace t if e′ is after e and if either:
t[ΣW ∩ Σi′ ] = ∅. We define a predicate RO(i) that holds for • i = i′ , i.e. the two events are part of the same request
read-only requests.
• e = Calli (ca, i′ ) and e′ = Reqi′ (ca) i ̸= i′ and
State. We represent the state of an application as σ ∈ D. sr∗ (i, i′ ), i.e., the second event is a part of a subrequest
Concretely, a state σ is a tuple of maps from keys to values, of the first event
one for each service. We define the function S : Σ∗ → D
that returns the state of an application after the trace t. Due to • e = Reti (v) and e′ = Respi′ (v, i), i.e., the events are a
Property 5 the state at each point in the execution depends on pair of return and handle response.
the prefix of write events and the starting state. We assume • e = Writei (k, v) and e′ = Readi′ (k, v ′ ) or e =
that all executions start from the same starting state σ0 . Writei (k, v) and e′ = Writei′ (k, v ′ ) or e = Readi (k, v)
Caching. Up to this point we have established all important and e′ = Writei′ (k, v ′ ), i.e., read and write events to
properties of microservice applications without mentioning a key k are dependencies of a prior write to the k and
caches. A cache-enabled application Pe can be similarly de- write events are dependencies of a prior read.
fined by its execution traces, JPeK ⊆ Σ e ∗ , where Σ
e ∗ is a su-
perset of the set of events of applications without caches, i.e. • e = Reti (v) and e′ = Save(nS , i, v) for some nS
Σ ⊆ Σ.e The additional cache related events are defined as • e = Writei (k, v) and e′ = Inv(nS , i, i′ ) for some nS
followed:
• e = Inv(nS , i′ , i′′′ ) and e′ = Inv(nS , i′ , i′′ ) with
• CacheHiti (ca, v) denotes a cache-hit that replaces a ca(i′′′ ) ∈ rs(i′′ ).
Respi′ (v, i) for some i′ (also conforming to its well-
formedness conditions Property 2). • e = Save(name(i), i′ , v) and e′ = CacheHiti (ca, v)
where ca(i′ ) = ca
• Save(nS , i, v) denotes that the cache of service nS has
saved the value v for request i with call arguments ca(i). • e = Save(nS , i′ , v) and e′ = Inv(nS , i, i′′ ) for some i
and ca(i′ ) = ca(i′′ )
• Inv(nS , i, i′ ) denotes an invalidation of the cache of ser-
vice nS with ca(i′ ) from a write with identifier i. We will use deps(e) to refer to all the transitive depen-
dencies of an event e. We now state a final assumption on
Essentially, a cache-enabled application Pe is a transformation application traces, namely that two independent events can
of a regular microservice application P . We know that our be commuted.
236 21st USENIX Symposium on Networked Systems Design and Implementation USENIX Association
Property 7 (Commute independent events). For any trace t ∈ For this cache-hit to have happened, the step relations implies
JP K with t = t0 .e.e′ .t1 and e′ ̸∈ deps(e), then t′ = t0 .e′ .e.t1 that there must exist some Save(nS , i′ , v) before it, such that
can also be observed by the application, i.e. t′ ∈ JP K. name(i′ ) = nS . Similarly, for the cache save to have happened,
there must have been a completed request with call arguments
This holds because in microservice applications indepen- ca = ca(i′ ).
dent requests do not affect each other except through reads
and writes to the same key in the same service datastore. t = · · · .Reqi′ (ca)|σ1 . · · · .Reti′ (v)|σ2 . · · · .
We are now ready to state the main theorem that describes · · · .Save(nS , i′ , v)|σ3 . · · · |σn .CacheHiti (ca, v)
the correctness of our protocol.
Given Property 2 (extended in a straightforward way to sup-
Theorem 8 (Protocol Correctness (corresponds to Theo-
port cache events), we know that t[Σi′ ] can be produced by
rem 1)). For all traces t in a cache-enabled application JPeK,
the step relation. The inductive hypothesis and the fact that t
there exists a trace t’ in the original application without
is finite ensure the equivalence of the traces even in the pres-
caches JP K, such that their respective client events are equiv-
ence of cache-hits for subrequests of the original request. We
alent (but potentially reordered), i.e., ∀i t[ΣC(i) ] = t′ [ΣC(i) ].
will now do a case analysis on the existence of a Write(k, v1 )
This makes sense because correctness is only relevant from where k ∈ rs(i′ ) between Reqi′ (ca) and CacheHiti (ca, v).
the perspective of the clients and not all of the internal events No such write exists. If no such write exists, then σ1 |rs(i′ ) =
that an application performs. Actually, the cache implemen- σ2 |rs(i′ ) = . . . = σn |rs(i′ ) . Then, we can construct a trace
tation does not contain the same traces because some calls t1 ∈ JP K using the inductive hypothesis and by replacing
return immediately on cache-hits without triggering all the CacheHiti (ca, v) with Calli (ca, i′′ ) for some fresh i′′ (due to
internal events. In order to prove this theorem, we show that Property 6).
something stronger holds, a lemma that is stated below. Be- t1 = . . . |σn .Calli (ca, i′′ )
fore stating it, we need to define what it means for an event
in the cache-enabled event set to be equivalent to the original Then, given that σ1 |rs(i′ ) = σn |rs(i′ ) and that Properties 5 and
one. 3 hold, we can construct the same request steps tc as in the
original trace (t[Σi′ ] [i′ 7→ i′′ ]) using the step relation, ending
Definition 4 (Equivalent events). Equivalence between a up with a trace t2 ∈ P such that:
cache-enabled event ec and an original event e (denoted with
ec ≃ e) is defined as followed: t2 = t1 .tc .Respi (v, i′′ )
• if e ∈ Σ and ec ∈ Σ are the same event or Since CacheHiti (ca, v) ≃ Respi (v, i′′ ) and read-only re-
quests do not modify the state, we are done with this case.
• ec = CacheHiti (ca(i′ ), v) and e = Respi (v, i′ )
Write exists. We now need to focus on the case where a
We can lift the equivalence relation of events to account write Write(k, v1 ) with k ∈ rs(i′ ) exists between Reqi′ (ca)
for sequence of events in a straightforward way. and CacheHiti (ca, v). We can first show that the write is be-
tween Save(nS , i′ , v) and CacheHiti (ca, v), because if it was
Lemma 1. Given an arbitrary trace t ∈ JPeK we can construct earlier, it would have been processed by the cache manager,
a trace t′ ∈ JP K such that (i) the states at the end of the traces prohibiting Save(nS , i′ , v) to have happened. However, there
are the same for both traces, i.e. S(t) = S(t′ ), and (ii) for could be an invalidate between the cache save and the cache-
all i, t[Σi ] ≃ t′ [Σi ] modulo the missing events due to the hit that has originated from a previous write in another service
cache-hits. between Reqi′ (ca) and Save(nS , i′ , v). We can now use Prop-
erty 7 to move all writes together with their dependencies to
At a high-level the proof proceeds by constructing a t’ from the end of the trace to get a trace tw .
t in the missing events and also by moving some write events
later in the trace. Theorem 8 follows directly from Lemma 1 tw = · · · .Save(nS , i′ , v)|σ3 · · · |σn .CacheHiti (ca, v). · · · .twd
since client events will be the same in both traces.
Proof sketch. We will proceed by induction on the size of where twd contains all the writes and their dependencies. This
traces and for the inductive case we will focus on the only is possible because CacheHiti (ca, v) is not a dependency of
interesting scenario where the trace t ends with a cache-hit the writes between the save and the cache-hit; if it was, there
event CacheHiti (ca, v), because these are the only events for must have been a subcall to the service where the write hap-
which the effects of our cache-subsystem are observed by the pened, which would have been caught by our dependency
rest of the application. For illustrative purposes we extend tracking (see Section 4). Second, all i′ events are not depen-
traces with the state of all services σn between each event. dencies of the writes between Reqi′ (ca) and Save(nS , i′ , v)
because (1) i′ is read-only (so it cannot have performed those
t = t0 |σn .CacheHiti (ca, v) writes or subcalls that performed those writes), and (2) the
USENIX Association 21st USENIX Symposium on Networked Systems Design and Implementation 237
F IGURE 20—A bug that would occur if preReqStart does not
wait until the Start event is added in the CM workqueue.
We then construct the original trace that caused the save using
the step relation (like in the no-write-exists case) to get
Finally, given that both prefixes and states are the same for tw
and t2 , we can use Property 4 to step all the writes and their
dependencies to acquire the same exactly events as the suffix
of tw , proving that the states are the same and the traces for
each request in the end are equivalent.
238 21st USENIX Symposium on Networked Systems Design and Implementation USENIX Association