Amazon Aurora: On Avoiding Distributed Consensus For I/Os, Commits, and Membership Changes
Amazon Aurora: On Avoiding Distributed Consensus For I/Os, Commits, and Membership Changes
Data, June 10–15, 2018, Houston, TX, USA. ACM, New York, NY, USA, 8 pages.
https://doi.org/10.1145/3183713.3196937 AZ 1 AZ 2 AZ 3
3/6 read
X
4/6 write
X
1 INTRODUCTION
X X Quorum
IT workloads are increasingly moving to public cloud providers X X X survives
X X AZ failure
such as AWS. Many of these workloads require a relational database.
Amazon Relational Database Service (RDS) provides a managed
service that automates database provisioning, operating system Figure 1: Why are 6 copies necessary ?
and database patching, backup, point-in-time restore, storage and
compute scaling, instance health monitoring, failover, and other
capabilities. Our experience managing hundreds of thousands of Quorum models, such as the one used by Aurora, are rarely used
in high-performance relational databases, despite the benefits they
Permission to make digital or hard copies of all or part of this work for personal or provide for availability, durability, and the reduction of latency jitter.
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
We believe this is because the underlying distributed algorithms
on the first page. Copyrights for components of this work owned by others than the typically used in these systems – two-phase commit (2PC), Paxos
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or commit, Paxos membership changes, and their variants – can be
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]. expensive and incur additional network overheads. The commercial
SIGMOD’18, June 10–15, 2018, Houston, TX, USA systems we have seen built on these algorithms may scale well but
© 2018 Copyright held by the owner/author(s). Publication rights licensed to the have order-of-magnitude worse cost, performance, and peak to
Association for Computing Machinery.
ACM ISBN 978-1-4503-4703-7/18/06. . . $15.00
average latency than a traditional relational database running on a
https://doi.org/10.1145/3183713.3196937 single node against local disk.
789
Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA
In this paper, we show how Aurora leverages only quorum I/Os, storage node (3) sorts and groups records, (4) gossips with peers to
locally observable state, and monotonically increasing log order- fill in missing records, (5) coalesces them into data blocks, (6) backs
ing to provide high performance, non-blocking, fault-tolerant I/O, them up to Amazon Simple Storage Service (S3), (7) garbage collects
commits, and membership changes. We limit our discussion to backed-up data that will no longer be referenced by an instance,
single-writer databases with read replicas. The approach described and (8) periodically scrubs data to ensure checksums continue to
below is extensible to multi-writer databases by ordering writes at match the data on disk.
database nodes, storage nodes, and using a journal to order opera-
tions that span multiple database instances and multiple storage STORAGE NODE
nodes. We describe the following contributions:
INCOMING QUEUE 1 7
(1) How Aurora performs writes using asynchronous flows, es- LOG RECORDS
Primary ACK
tablishes local consistency points, uses consistency points Instance
GC
790
Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA
Changes to data blocks modify the image in the Aurora buffer given write may be lost for any reason we need to tolerate missing
cache and add the corresponding redo record to a log buffer. These writes in the storage nodes.
are periodically flushed to a storage driver to be made durable.
SCL is sent by the storage node as part of acknowledging a write.
Inside the driver, they are shuffled to individual write buffers for
Once the database instance observes SCL advance at four of six
each storage node storing segments for the data volume. The dri-
members of the protection group, it is able to locally advance the
ver asynchronously issues writes, receives acknowledgments, and
Protection Group Complete LSN (PGCL), representing the point
establishes consistency points.
at which the protection group has made all writes durable. For
Each log record stores the LSN of the preceding log record in the example, Figure 3 shows a database with two protection groups,
volume, the previous LSN for the segment, and the previous LSN PG1 and PG2, consisting of segments A1-F1 and A2-F2 respectively.
for the block being modified. The block chain is used by the storage In the figure, each solid cell represents a log record acknowledged
node to materialize individual blocks on demand. The segment by a segment, with the odd numbered log records going to PG1 and
chain is used by each storage node to identify records that it has the even numbered log records going to PG2. Here, PG1’s PGCL is
not received and fill in these holes by gossiping with other storage 103 because 105 has not met quorum, PG2’s PGCL is 104 because
nodes. The full log chain is not needed by an individual storage 106 has not met quorum, and the database’s VCL is 104 which is the
node but provides a fallback path to regenerate storage volume highest point at which all previous log records have met quorum.
metadata in case of a disastrous loss of metadata state.
Many database systems boxcar redo log writes to improve through-
put. There is a challenge in deciding, with each record, whether
to issue the write, to improve latency, or to wait for subsequent
records, to improve write efficiency and throughput. Waiting cre-
ates performance jitter since early requests entering the boxcar
have to wait for later requests or a timeout to fill the request. Jitter
is greatest under low load when the boxcar times out.
In Aurora, there are many segments partitioning the redo log and
the opportunity to boxcar are lower than with a single unsegmented
redo log. Aurora handles this by submitting the asynchronous net-
work operation when it receives the first redo log record in the
boxcar but continuing to fill the buffer until the network operation Figure 3: Storage Consistency Points
executes. This ensures requests are sent without boxcar latency and
jitter while packing records together to minimize network packets.
For a database, it is not enough for individual writes to be made
In Aurora, all log writes, including those for commit redo log durable, the entire log chain must be complete to ensure recoverabil-
records, are sent asynchronously to storage nodes, processed asyn- ity. The database instance also locally advances a Volume Complete
chronously at the storage node, and asynchronously acknowledged LSN (VCL) once there are no pending writes preventing PGCL from
back to the database instance. advancing for one of its protection groups. No consensus is required
to advance SCL, PGCL, or VCL – all that is required is bookkeep-
ing by each individual storage node and local ephemeral state on
2.3 Storage Consistency Points and Commits the database instance based on the communication between the
A traditional relational database working with local disk would database and storage nodes.
write a commit redo log record, boxcar commits together using
group commit, and flush the log to ensure that it has been made This is possible because storage nodes do not have a vote in
durable. When working with remote storage, it might use a two- determining whether to accept a write, they must do so. Locking,
phase commit, or a Paxos commit, or variant, to establish a con- transaction management, deadlocks, constraints, and other con-
sistency point since there is no individual flush operation across ditions that influence whether an operation may proceed are all
all storage nodes. This is heavyweight and introduces stalls and resolved at the database tier. Processing offloaded to the Aurora
jitter into the write path. Distributed commit protocols also have storage nodes can progress by executing idempotent operations
failure modalities different from those of quorum writes, making it using local state. This also ensures that failed storage nodes can
complex to reason about availability and durability. transparently be repaired without involving the database instance.
As a storage node receives new log records, it may locally ad- A commit is acknowledged by the database to its caller once
vance a Segment Complete LSN (SCL), representing the latest point it is able to affirm that all data modified by the transaction has
in time for which it knows it has received all log records. More been durably recorded. A simple way to do so is to ensure that the
precisely, SCL is the inclusive upper bound on log records continu- commit redo record for the transaction, or System Commit Number
ously linked through the segment chain without gaps. SCL is used (SCN), is below VCL. No flush, consensus, or grouping is required.
by storage nodes as a compact way to identify missing writes when Aurora must wait to acknowledge commits until it is able to
gossiping with their peers in a protection group. Note since any advance VCL beyond the requesting SCN. Typically, this would
791
Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA
require stalling the worker thread acting upon the user request. In and writes, Aurora increments an epoch in its storage metadata
Aurora, user sessions are multiplexed to worker threads as requests service and records this volume epoch in a write quorum of each
are received. When a commit is received, the worker thread writes protection group comprising the volume. The volume epoch is
the commit record, puts the transaction on a commit queue, and provided as part of every read or write request to a storage node.
returns to a common task queue to find the next request to be pro- Storage nodes will not accept requests at stale volume epochs. This
cessed. When a driver thread advances VCL, it wakes up a dedicated boxes out old instances with previously open connections from
commit thread that scans the commit queue for SCNs below the accessing the storage volume after crash recovery has occurred.
new VCL and sends acknowledgements to the clients waiting for Some systems use leases to establish short term entitlements to
commit. There is no induced latency from group commits and no access the system, but leases introduce latency when one needs to
idle time for worker threads. wait for expiry. Aurora, rather than waiting for a lease to expire,
just changes the locks on the door.
2.4 Crash Recovery in Aurora
No redo replay is required as part of crash recovery since seg-
Aurora is able to avoid distributed consensus during writes and ments are able to generate data blocks on their own. Undo of previ-
commits by managing consistency points in the database instance ously active transactions is required but can occur after the database
rather than establishing consistency across multiple storage nodes. has been opened in parallel with user activity.
But, instances fail. Customers shut them down, resize them, and
restore them to older points in time. The time we save in the normal
forward processing of commits using local transient state must be 3 MAKING READS EFFICIENT
paid back by re-establishing consistency upon crash recovery. This Reads are one of the few operations in Aurora where threads have
is a trade worth making since commits are many orders of magni- to wait. Unlike writes, which can stream asynchronously to storage
tude more common than crashes. Since instance state is ephemeral, nodes, or commits, where a worker can move on to other work
the Aurora database instance must be able to construct PGCLs and while waiting for storage to acknowledge, a thread needing a block
VCL from local SCL state at storage nodes. not in cache typically must wait for the read I/O to complete before
it can progress.
AT CRASH
In a quorum system, the I/O required for a read is amplified by
Log records Gaps
the size of the read quorum. Network traffic is far higher since
one is reading full data blocks, unlike writes, where Aurora only
CRASH
ships log records. A buffer cache miss in Aurora’s quorum model
would seem to require a minimum of three read I/Os, and likely
Volume Complete
LSN (VCL)
five, to mask outlier latency and intermittent unavailability. Read
performance in quorum systems compares poorly to traditional
IMMEDIATELY AFTER CRASH RECOVERY
replication models where one writes to all copies, enabling a read
from just one, though those models have worse write availability.
792
Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA
or by finding the latest durable version of the block in one of the is promoted to a write instance – it only needs to run a local crash
segments of the protection group that it belongs to. recovery to align its in-memory state.
Aurora does not do quorum reads. Through its bookkeeping of
3.3 Structural Consistency in Aurora Replicas
writes and consistency points, the database instance knows which
segments have the last durable version of a data block and can Managing structural consistency with asynchronous operations
request it directly from any of those segments. Avoiding the ampli- against shared durable state requires care. A single writer has local
fication of read quorums does make Aurora subject to latency when state for all writes and can easily coordinate snapshot isolation,
storage nodes are down or jitter when they are busy. We manage consistency points for storage, transaction ordering, and structural
this by tracking response time from storage nodes for read requests. atomicity. It is more complex for replicas.
The database instance will usually issue a request to the segment Aurora uses three invariants to manage replicas. First, replica
with the lowest measured latency, but occasionally also query one read views must lag durability consistency points at the writer in-
of the others in parallel to ensure up to date read latency response stance. This ensures that the writer and reader need not coordinate
times. If a request is taking longer than expected, will issue a read to cache eviction. Second, structural changes to the database, for ex-
another storage node and accept whichever one returns first. This ample B-Tree splits and merges, must be made visible to the replica
caps the latency due to slow or unavailable segments. In an active atomically. This ensures consistency during block traversals. Third,
system, this can be done without request timeouts by inspecting read views on replicas must be anchorable to equivalent points in
the list of outstanding requests when performing other I/Os. time on the writer instance. This ensures that snapshot isolation is
preserved across the system.
3.2 Scaling Reads Using Read Replicas To understand structural consistency on the replica, let us first
Many database systems scale reads by replicating updates from examine structural consistency on the writer instance, using Au-
a writer instance to a set of read replica instances. Typically, this rora MySQL as an example. Each database transaction in Aurora
involves transporting either logical statement updates or physical MySQL is a sequence of ordered mini-transactions (MTRs) that are
redo log records from the writer to the readers. Replication is done performed atomically. Each MTR is composed of changes to one
synchronously if the replicas are intended as failover targets with- or more data blocks, represented as a batch of sequenced redo log
out data loss and asynchronously if replica lag or data loss during records to provide consistency of structural changes, such as those
failover is acceptable. involving B-Tree splits. The database instance acquires latches for
Both synchronous and asynchronous replication have undesir- each data block, allocates a batch of contiguously ordered LSNs,
able characteristics. Synchronous replication introduces perfor- generates the log records, issues a write, shards then into write
mance jitter and failure modalities in the write path. Asynchronous buffers for each protection group associated with the blocks, and
replication introduces data loss on failure of the writer. In both writes them to the various storage nodes for the segments in the
cases, replication takes time to set up, requiring copying the un- protection group. We use an additional consistency point, the Vol-
derlying database volume and catching up on active changes. It is ume Durable LSN (VDL), to represent the last LSN below VCL
also expensive, since it doubles not only the instance costs, but also representing an MTR completion.
storage costs. Much of the throughput of the replica instance goes Replicas do not have the benefit of the latching used at the
to replicate write activity, not to scaling reads. writer instance to prevent read requests from seeing non-atomic
Aurora supports logical replication to communicate with non- structural updates. To create equivalent ordering, we ensure that log
Aurora systems and in cases where the application does not want records are only shipped from the writer instance in MTR chunks.
physical consistency – for example, when schemas differ. Internally, At the replica, they must be applied in LSN order, applied only if
within an Aurora cluster, we use physical replication. Aurora read above the VDL in the writer as seen in the replica, and applied
replicas attach to the same storage volume as the writer instance. atomically in MTR chunks to the subset of blocks in the cache.
They receive a physical redo log stream from the writer instance Read requests are made relative to VDL points to avoid seeing
and use this to update only data blocks present in their local caches. structurally inconsistent data.
Redo records for uncached blocks can be discarded, as they can be
read from the shared storage volume. 3.4 Snapshot Isolation and Read View Anchors
in Aurora Replicas
This approach allows Aurora customers to quickly set up and tear
down replicas in response to sharp demand spikes, since durable Once we have ensured that cached replica state is structurally
state is shared. Adding replicas does not change availability or consistent, allowing traversal of physical data structures, we must
durability characteristics, since durable state is independent from also ensure it is also logically consistent using snapshot isolation.
the number of instances accessing that state. There is little latency The redo log seen by a read replica does not carry the state
added to the write path on the writer instance since replication needed to establish SCL, PGCL, VCL, or VDL consistency points.
is asynchronous. Since we only update cached data blocks on the Nor is the read replica in the communication path between the
replicas, most resources on the replica remain available for read re- writer and storage nodes to establish this state on its own. Note
quests. And most importantly, if a commit has been marked durable that VDL advances based on acknowledgements from storage nodes,
and acknowledged to the client, there is no data loss when a replica not redo issuance from the writer. The writer instance sends VDL
793
Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA
update control records as part of its replication stream. Although ensuring each transition is reversible. Each membership change to
the active transaction list can be reconstructed at the replica us- a protection group is associated with a membership epoch, which is
ing redo records and VDL advancement, for efficiency reasons we monotonically incremented with each change. Membership changes
ship commit notifications and maintain transaction commit history. do not block either reads or writes.
Read views at the replica are built based on these VDL points and
Each read or write request from an instance and each gossip
transaction commit history. Replicas revert active transactions for
request from a peer segment passes in the epoch based on their
MVCC using undo, just as on the writer instance.
current understanding of quorum membership. As with volume
Since VDL on the replica may lag the writer, Aurora storage epochs, clients with stale membership epochs have their requests
nodes must ensure that past values are available to be read. Au- rejected and must update membership information. An epoch in-
rora blocks are written out-of-place and non-destructively. Older crement requires a write quorum to be met, just as any other write
versions are not garbage collected until we can assure neither the does. The request to increment membership epoch must pass in
writer instance or any replica might need to access it. We do this by the correct membership epoch, just as any other request does. As
maintaining a Protection Group Minimum Read Point LSN (PGM- with our other epochs, membership epochs ensure we can update
RPL), representing the lowest LSN read point for any active request membership without complex consensus, fence out others with-
on that database instance. A storage node may only advance its out waiting for lease expiry, and operate using the same failure
garbage collection point once PGMRPL has advanced for all in- tolerance as quorum reads and writes themselves.
stances that have opened the volume. The storage nodes will only
accept read requests between PGMRPL and SCL. A B C D E F
Epoch 1: All node healthy
794
Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA
sequence of errors and repairs may be. Transitions require only the us achieve storage prices comparable to low-cost alternatives, while
single epoch update to the write quorum of a protection group. Up- providing high durability, availability, and performance.
dates of stale state are similarly simple, requiring just one additional
request past the one rejected. 5 RELATED WORK
We also use epochs to manage volume growth, using a volume In this section we discuss other contributions and how they relate
geometry epoch that increments with each protection group added to the techniques used in Aurora and discussed in this paper.
to the volume. This can also be used to change the quorum model
Consensus and Distributed Transactions. Distributed systems
itself, for example, when moving from a 4/6 write quorum to 3/4 to
rely on consensus to allow a group of processes to agree on a sin-
handle the extended loss of an AZ.
gle value and tolerate faults in one or more of its members. Some
notable consensus algorithms include Paxos and variants [4, 5],
4.2 Using Quorum Sets to Reduce Costs Raft [9], and Viewstamped Replication [8]. A distributed database
Quorums are generally thought of as a collection of like members, requires a commit protocol that enforces that all processes start out
grouped together to transparently handle failures. However, there in a “working” state and all either end in an “aborted” or “committed”
is nothing in the quorum model to prevent unlike members with state. Distributed commit may be implemented using consensus
differing latency, cost, or durability characteristics. protocols such as Paxos or other approaches like 2-phase commit
and can incur considerable network overheads. Another recent sys-
In Aurora, a protection group is composed of three full segments,
tem that avoids the use of distributed commit is Calvin [11] which
which store both redo log records and materialized data blocks, and
implements a transaction scheduling and data replication layer that
three tail segments, which contain redo log records alone. Since
uses a deterministic ordering guarantee. Since all nodes reach an
most databases use much more space for data blocks than for redo
agreement regarding what transactions to attempt and in what or-
logs, this yields a cost amplification closer to three copies of the data
der, Calvin is able to completely avoid distributed commit protocols,
rather than a full six while satisfying our requirement to support
reducing the contention footprints of distributed transactions.
AZ+1 failures.
Quorums. Quorum-based approaches have been used for distributed
The use of full and tail segments changes how we construct our
commit protocols [10] as well as for replicating data [3].
read and write sets. Our write quorum is 4/6 of any segment OR 3/3
of full segments. Our read quorum is therefore 3/6 of any segment Distributed SQL Databases. Google Cloud Spanner [1] is a SQL
AND 1/3 of full segments. In practice, this means that we write log database on a quorum replicated system, using Multi-Paxos to
records to the same 4/6 quorum as we did previously. At least one establish consensus for every write providing strong consistency
of these log records arrives at a full segment and generates a data guarantees. Cloud Spanner enables clustering of tables to reduce
block. We read data from our full segments, using the optimization the participants in distributed transactions.
described earlier to avoid quorum reads.
Replication. Traditional database replication techniques consume
Repairing a tail segment simply requires reading from the other a physical or logical log that represents changes made in the data-
members of the protection group, using our SCL to determine and base and replicates these changes in a completely independent
fill in the gaps from other quorum members with SCLs higher than database. For example, Liu et al [6] describe how DB2 implements
our own. Repairing a full segment is a bit more complex since the transactional replication from a partitioned database system by com-
segment being repaired may have been the only full segment that bining the physical write-ahead log from each node. Oracle uses
saw the last write to the protection group. physical replication via Data Guard [2] to provide high availability
and disaster recovery. Some database systems like MySQL support
Even so, we must have at least one other full segment from
logical replication [7] using command/statement logging [13].
which we can read data blocks even if it has not seen the most
recent write. We have enough copies of the redo log record so that
we can rebuild a full segment and be up to date. We also gossip 6 CONCLUSIONS
between the segments of a quorum to ensure that any missing Aurora avoids considerable network, storage, and database process-
writes are quickly filled in. This reduces the probability we need to ing by leveraging a few simple techniques to avoid complex, brittle,
rebuild a full segment without adding a performance burden to our and expensive consensus protocols. Most distributed consensus
write path. Once we have our full segment baseline, we can obtain algorithms abhor state and establish their baseline from first prin-
redo log records from other segments using our SCL in the same ciples. But, databases are all about the management of state. Why
manner as tail segments. not use it for our own benefit?
There are many options available once one moves to quorum sets Aurora is able to avoid much of the work of consensus by recog-
of unlike members. One can combine local disks to reduce latency nizing that, during normal forward processing of a system, there
and remote disks for durability and availability. One can combine are local oases of consistency. Using backward chaining of redo
SSDs for performance and HDDs for cost. One can span quorums records, a storage node can tell if it is missing data and gossip with
across regions to improve disaster recovery. There are numerous its peers to fill in gaps. Using the advancement of segment chains,
moving parts that one needs to get right, but the payoffs can be a database instance can determine whether it can advance durable
significant. For Aurora, the quorum set model described earlier lets points and reply to clients requesting commits. Coordination and
795
Industry 3: DB Systems in the Cloud and Open Source SIGMOD’18, June 10-15, 2018, Houston, TX, USA
796