Cosmos DB at VLDB 2019

Azure Cosmos DB
Microsoft’s globally distributed database service

Agenda
1. Introduction
2. System model
3. Global distribution
4. Resource governance
5. Conclusion

Introduction
• Started in 2010 as Project Florence
• Generally available since 2017
• Ring-0, foundational Azure service, operating in all (54+) Azure regions and all
Azure clouds (DoD, Sovereign, etc.), serving 10s of trillions requests/day
• Became ubiquitous inside Microsoft and is one of the fastest growing Azure
services

Database designed for the cloud
Global Distribution
1
Elastic and unlimited scalability
Millions of transactions/sec
Petabytes of data
Hundreds transactions/sec
Gigabytes of data
2
Cost efficiencies with fine-grained
multi-tenancy
3

Design goals
1. Elastically scale throughput on-demand across any number of Azure regions
around the world, within 5s at the 99th percentile
2. Deliver <10ms end-to-end client-server read/write latencies at the 99th
percentile, in any Azure region
3. Oﬀer 99.999% read/write high availability
4. Provide tunable consistency models for developers
5. Operate at a very low cost
6. Provide strict performance isolation between transactional and analytical
workloads
7. Build schema-agnostic engine to support unbounded schema versioning
8. Support multiple data models, and multiple popular OSS database APIs all
operating on the same underlying data

• 3
.
• 4
.
• 5
.
6
.
Globally Distributed Multi-Model Database service
7
Azure Cosmos DB

Logical system model
• Using their Azure subscription, tenants create
one or more Cosmos accounts
• Customers can associate one or more Azure
regions with their Cosmos account and specify
the API for interacting with their data.
• Currently supported APIs – MongoDB,
Cassandra, SQL, Gremlin, Etcd, Spark, etc.
• Cosmos account manages one or more
databases
• Depending on the API, a database manages
one or more tables (or collections or graphs)
• Tables, Collections and Graphs get translated
to Containers
• Containers are schema-agnostic repositories of
Items
• Containers are horizontally-partitioned based
on throughput and storage
• A partition consists of replica set and is a
strongly consistent and highly available
• A partition-set consists of partitions globally
distributed across multiple Azure regions
Tenants

Physical system model
• Cosmos DB service is
bootstrapped in a DC via the
CCP (Cosmic Control Plane)
• CCP is fully decentralized and
with its state replicated using
another internal instance of
Cosmos DB
• In each DC, multiple Cosmos
compute and storage clusters
• Each cluster is spread across
10-20 fault domains and 3
AZs
• Continuous capacity
management with dynamic,
predictive x-cluster, x-DC load
balancers
• Each machine hosts replicas
belonging to 200-500 tenants
• Each replica hosts an instance
of Cosmos database engine

Database Engine – key ideas
Index is a union of materialized trees -> Automatic indexing
Materializing ingested content as a tree → blurs the boundary between schema and instance value
Schemas are speculatively/dynamically inferred
from the content
1
2 3
Avro
Parquet

Index is decoupled from the content; content
separately stored in row & column stores; all fully
resource governed, log structured and latch-free
Schema agnostic and version resilient storage encoding and data model for content -> multi-model engine
User specified TTL policies for each store; strict performance
isolation between OLTP & analytical workloads
4
5
6
JSON
BSON Parquet
CQL
SQL
ProtoBuf
Data model of an item
Thrift
Column compressed, append-only redo-
log + invalidation bitmaps stored on
inexpensive, off-cluster storage
row store
column store
Committed updates stored
on local SSDs
Tiering

Multiple APIs get mapped onto Query IL
Logical snapshots over the redo-log for
both, time-travel queries as well as PITR
Fully resource governed; capable of operating within
frugal resource budgets
7
9 10
column store
Blind writes via commutative Merge
8
Query IL
t
snapshots

Global distribution is turnkey

Partition as a Lego block for coordination
• Partition is used as a strongly consistent, highly
available, resource-governed, reusable building block
to solve many coordination problems in the system
• Global replication
• Inter-cluster partition load balancing
• Partition management operations including split, merge,
clone
• On-demand streaming
• Nested consensus
• Two reconfiguration state machines with failure detectors
composing with each other
• Dynamically self-adjusting membership and quorums
• Built-in resource governance
• Leaders and followers are allotted different budgets
• Leaders are unburdened from serving reads
Split Merge Global replication

Global distribution • The distribution protocol propagates writes within a partition-set, which typically
spans over multiple regions
• Instead of stretching a replica-set across all regions, the protocol maintains a nested
replica-set in each region
• Significantly improves fault tolerance and avoids in-cast bottlenecks due to hierarchical
aggregation of acks
• Massive, dynamic connected graph of partitions stretching across Azure datacenters
worldwide
• Control-plane ensures strong routing consistency despite continuous addition and
removal of regions, partition load balancing, splits, merge etc.
• Topology state is made highly available directly inside the data plane
• Guaranteed low latency and high availability reads and writes with multi-master
replication
• Reads and writes are always local to the region
• Guaranteed RPO and RTO of 0
• Multi-homing with location aware routing, programmatic support to add/remove
regions, simulate regional disaster, trigger failover etc.
• Five precisely defined consistency levels: strong, bounded-staleness, session, prefix,
and eventual consistency
• The consistency level selected determines the read quorum size within the replica-set in
local region
• Distribution protocol touches each write only once to achieve the immediate
commit within a region upon receipt
• The protocol remains scale independent and latency insensitive with N

Consensus within a replica set
1
2
3
4
5
6
7
8
leader - Receive request and start txn, validate & analyze content,
generate schemas, generate index terms
leader - Update index store, update content store
leader - Append redo-log
leader - Replicate to followers
follower/xp-leader - start txn, apply updates to index and content
stores, update redo log, commit txn, replicate to remote-
partitions (xp-leader)
follower - Send ack to the leader
leader - Receive the quorum ack
leader - commit txn
leader - Columnar compression of redo records, apply
invalidations, group commit to column store
leader - Send response to the client
9
Partition

Consensus within a partition set
1
2
2
1
4
4
1
1
• All writes are durably committed in local quorum
• Writes are visible immediately in the region
• Change log of tentative writes is shipped for merge via anti-
entropy
• Each region maintains sequence number for tentative writes
• Version vector clock captures the global progress
• Record level conflicts are detected using vector clock
comparison
• Conflict resolution is performed at a dynamically picked
arbiter
• Results of conflict resolutions are shipped to all regions to
ensure guaranteed convergence
P1 – write [1,0,0], P2 – write [0,a,0] and P3 - write [0,0,x]
P1 – apply [1,0,0] locally and xp-replicate [1,0,0] to P2 and P3
P2 - apply [0,a,0][1,0,0] locally and P3 – apply [0,0,x] [1,0,0] locally
P2 - merge [0,a,0] and P3 – merge [0,0,x]
P1 – merge [0,a,0][0,0,x]
P1 – conflict resolution [1,a,x]
2
2
2
2
3
3
3
3
3
3
3
5
4
5
6
6
P1
P2
P3

Precisely defined consistency levels
Strong Bounded-staleness Session Consistent-prefix Eventual

Continuous consistency checking over live data
Cosmos DB Cluster
Consistency Checker Service
Request logs
Metrics DB
Azure Portal
Consistency
metrics
Blue vertices -> Writes
Green vertices -> Reads
Linearizable History
Dependency graph of operations
Customer
Requests
Observed consistency is reported to customers along with
the Probability Bounded Staleness (PBS) metric
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 500 1000 1500 2000 2500
P
Time After Commit(ms)
P(Consistency)
US West US East North Europe

Fully resource governed stack
• Stateless communication protocols everywhere with
fixed upper bound on processing latency
• Capacity management, COGS and SLAs all depend on
stringent resource governance across the entire stack
• Request Unit (RU)
• Rate based currency (/sec, /min, /hr)
• Normalized across various database operations
• ML pipeline to calculate the query charges across
different datasets and query patterns
• Need to remain consistent across hardware
generations
• Automated perf and RG tracking every four hours to
detect code regressions
• All engine micro-operations are finely calibrated to live
within the fixed budgets of system resources

100,000,000
1,000,000,000
10,000,000,000
100,000,000,000
1,000,000,000,000
sum_TotalRequestCharge
Consumed
RUs
3 trillion transactions in 3 days across 20 Azure regions
Central US West US East US East US 2 North Europe West Europe
North Central US East Asia South Central US Southeast Asia Australia East West US 2
West Central US Japan East Brazil South UK South South India Canada Central
Australia Southeast UK West Japan West Germany Central West India Central India
Korea South Canada East Korea Central Central US EUAP East US 2 EUAP Germany Northeast
Microsoft internal workloads 9/7/2017-9/10/2017
Highest throughput consumption between
9/7/2017- 9/10/2017 in Central US

Zooming into one of clusters in Central US
N-102 is
serving
more RUs

Multi-tenancy and horizontal partitioning

Multi-tenancy and global distribution

Conclusion
• Cosmos is Microsoft’s globally distributed database service
• Foundational Azure service on which other Azure services depend on
• Battle tested by mission critical, globally distributed workloads and
massive scale
• Multi-master replication, global distribution, fine-grained resource
governance and responsive partition management are central to its
architecture
• Precisely defined multiple consistency levels with clear tradeoffs
• Comprehensive and stringent SLAs spanning consistency, latency,
throughput and availability

References
• Azure Cosmos DB, http://cosmosdb.com
• Consistency Levels in Azure Cosmos DB, https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-
levels
• “Schema-Agnostic Indexing with Azure DocumentDB,” PVLDB 8(12), 1668-1679 (2016)
• D. Terry. Replicated data consistency explained through baseball, Microsoft Technical Report MSR-TR-2011-
137, October 2011. To appear in Communications of the ACM, December 2013.
• Lisa Glendenning , Ivan Beschastnikh , Arvind Krishnamurthy , Thomas Anderson, Scalable consistency in
Scatter, Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, October 23-26,
2011, Cascais, Portugal
• Moni Naor , Avishai Wool, The Load, Capacity, and Availability of Quorum Systems, SIAM Journal on
Computing, v.27 n.2, p.423-447, April 1998
• Douglas B. Terry, Alan J. Demers, Karin Petersen, Mike Spreitzer, Marvin Theimer, and Brent B. Welch. 1994.
Session Guarantees for Weakly Consistent Replicated Data. In Parallel and Distributed Information Systems
(PDIS). 140–149.
• Peter Bailis , Shivaram Venkataraman , Michael J. Franklin , Joseph M. Hellerstein , Ion Stoica,
Probabilistically bounded staleness for practical partial quorums, Proceedings of the VLDB Endowment, v.5
n.8, p.776-787, April 2012

http://joincosmosdb.com

Cosmos DB at VLDB 2019

More Related Content

What's hot (20)

Similar to Cosmos DB at VLDB 2019 (20)

Recently uploaded (20)

Cosmos DB at VLDB 2019