SlideShare a Scribd company logo
Azure Cosmos DB
Microsoft’s globally distributed database service
Agenda
1. Introduction
2. System model
3. Global distribution
4. Resource governance
5. Conclusion
1. Introduction
Introduction
• Started in 2010 as Project Florence
• Generally available since 2017
• Ring-0, foundational Azure service, operating in all (54+) Azure regions and all
Azure clouds (DoD, Sovereign, etc.), serving 10s of trillions requests/day
• Became ubiquitous inside Microsoft and is one of the fastest growing Azure
services
Database designed for the cloud
Global Distribution
1
Elastic and unlimited scalability
Millions of transactions/sec
Petabytes of data
Hundreds transactions/sec
Gigabytes of data
2
Cost efficiencies with fine-grained
multi-tenancy
3
Design goals
1. Elastically scale throughput on-demand across any number of Azure regions
around the world, within 5s at the 99th percentile
2. Deliver <10ms end-to-end client-server read/write latencies at the 99th
percentile, in any Azure region
3. Offer 99.999% read/write high availability
4. Provide tunable consistency models for developers
5. Operate at a very low cost
6. Provide strict performance isolation between transactional and analytical
workloads
7. Build schema-agnostic engine to support unbounded schema versioning
8. Support multiple data models, and multiple popular OSS database APIs all
operating on the same underlying data
• 3
.
• 4
.
• 5
.
6
.
Globally Distributed Multi-Model Database service
7
Azure Cosmos DB
2. System model
Logical system model
• Using their Azure subscription, tenants create
one or more Cosmos accounts
• Customers can associate one or more Azure
regions with their Cosmos account and specify
the API for interacting with their data.
• Currently supported APIs – MongoDB,
Cassandra, SQL, Gremlin, Etcd, Spark, etc.
• Cosmos account manages one or more
databases
• Depending on the API, a database manages
one or more tables (or collections or graphs)
• Tables, Collections and Graphs get translated
to Containers
• Containers are schema-agnostic repositories of
Items
• Containers are horizontally-partitioned based
on throughput and storage
• A partition consists of replica set and is a
strongly consistent and highly available
• A partition-set consists of partitions globally
distributed across multiple Azure regions
Tenants
Physical system model
• Cosmos DB service is
bootstrapped in a DC via the
CCP (Cosmic Control Plane)
• CCP is fully decentralized and
with its state replicated using
another internal instance of
Cosmos DB
• In each DC, multiple Cosmos
compute and storage clusters
• Each cluster is spread across
10-20 fault domains and 3
AZs
• Continuous capacity
management with dynamic,
predictive x-cluster, x-DC load
balancers
• Each machine hosts replicas
belonging to 200-500 tenants
• Each replica hosts an instance
of Cosmos database engine
Database Engine – key ideas
Index is a union of materialized trees -> Automatic indexing
Materializing ingested content as a tree → blurs the boundary between schema and instance value
Schemas are speculatively/dynamically inferred
from the content
1
2 3
Avro
Parquet
Database Engine – key ideas
Index is decoupled from the content; content
separately stored in row & column stores; all fully
resource governed, log structured and latch-free
Schema agnostic and version resilient storage encoding and data model for content -> multi-model engine
User specified TTL policies for each store; strict performance
isolation between OLTP & analytical workloads
4
5
6
JSON
BSON Parquet
CQL
SQL
ProtoBuf
Data model of an item
Thrift
Column compressed, append-only redo-
log + invalidation bitmaps stored on
inexpensive, off-cluster storage
row store
column store
Committed updates stored
on local SSDs
Tiering
Database Engine – key ideas
Multiple APIs get mapped onto Query IL
Logical snapshots over the redo-log for
both, time-travel queries as well as PITR
Fully resource governed; capable of operating within
frugal resource budgets
7
9 10
column store
Blind writes via commutative Merge
8
Query IL
t
snapshots
3. Global distribution
Global distribution is turnkey
Partition as a Lego block for coordination
• Partition is used as a strongly consistent, highly
available, resource-governed, reusable building block
to solve many coordination problems in the system
• Global replication
• Inter-cluster partition load balancing
• Partition management operations including split, merge,
clone
• On-demand streaming
• Nested consensus
• Two reconfiguration state machines with failure detectors
composing with each other
• Dynamically self-adjusting membership and quorums
• Built-in resource governance
• Leaders and followers are allotted different budgets
• Leaders are unburdened from serving reads
Split Merge Global replication
Global distribution • The distribution protocol propagates writes within a partition-set, which typically
spans over multiple regions
• Instead of stretching a replica-set across all regions, the protocol maintains a nested
replica-set in each region
• Significantly improves fault tolerance and avoids in-cast bottlenecks due to hierarchical
aggregation of acks
• Massive, dynamic connected graph of partitions stretching across Azure datacenters
worldwide
• Control-plane ensures strong routing consistency despite continuous addition and
removal of regions, partition load balancing, splits, merge etc.
• Topology state is made highly available directly inside the data plane
• Guaranteed low latency and high availability reads and writes with multi-master
replication
• Reads and writes are always local to the region
• Guaranteed RPO and RTO of 0
• Multi-homing with location aware routing, programmatic support to add/remove
regions, simulate regional disaster, trigger failover etc.
• Five precisely defined consistency levels: strong, bounded-staleness, session, prefix,
and eventual consistency
• The consistency level selected determines the read quorum size within the replica-set in
local region
• Distribution protocol touches each write only once to achieve the immediate
commit within a region upon receipt
• The protocol remains scale independent and latency insensitive with N
Consensus within a replica set
1
2
3
4
5
6
7
8
leader - Receive request and start txn, validate & analyze content,
generate schemas, generate index terms
leader - Update index store, update content store
leader - Append redo-log
leader - Replicate to followers
follower/xp-leader - start txn, apply updates to index and content
stores, update redo log, commit txn, replicate to remote-
partitions (xp-leader)
follower - Send ack to the leader
leader - Receive the quorum ack
leader - commit txn
leader - Columnar compression of redo records, apply
invalidations, group commit to column store
leader - Send response to the client
9
Partition
Consensus within a partition set
1
2
2
1
4
4
1
1
• All writes are durably committed in local quorum
• Writes are visible immediately in the region
• Change log of tentative writes is shipped for merge via anti-
entropy
• Each region maintains sequence number for tentative writes
• Version vector clock captures the global progress
• Record level conflicts are detected using vector clock
comparison
• Conflict resolution is performed at a dynamically picked
arbiter
• Results of conflict resolutions are shipped to all regions to
ensure guaranteed convergence
P1 – write [1,0,0], P2 – write [0,a,0] and P3 - write [0,0,x]
P1 – apply [1,0,0] locally and xp-replicate [1,0,0] to P2 and P3
P2 - apply [0,a,0][1,0,0] locally and P3 – apply [0,0,x] [1,0,0] locally
P2 - merge [0,a,0] and P3 – merge [0,0,x]
P1 – merge [0,a,0][0,0,x]
P1 – conflict resolution [1,a,x]
2
2
2
2
3
3
3
3
3
3
3
5
4
5
6
6
P1
P2
P3
Precisely defined consistency levels
Strong Bounded-staleness Session Consistent-prefix Eventual
Continuous consistency checking over live data
Cosmos DB Cluster
Consistency Checker Service
Request logs
Metrics DB
Azure Portal
Consistency
metrics
Blue vertices -> Writes
Green vertices -> Reads
Linearizable History
Dependency graph of operations
Customer
Requests
Observed consistency is reported to customers along with
the Probability Bounded Staleness (PBS) metric
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 500 1000 1500 2000 2500
P
Time After Commit(ms)
P(Consistency)
US West US East North Europe
5. Resource governance
Fully resource governed stack
• Stateless communication protocols everywhere with
fixed upper bound on processing latency
• Capacity management, COGS and SLAs all depend on
stringent resource governance across the entire stack
• Request Unit (RU)
• Rate based currency (/sec, /min, /hr)
• Normalized across various database operations
• ML pipeline to calculate the query charges across
different datasets and query patterns
• Need to remain consistent across hardware
generations
• Automated perf and RG tracking every four hours to
detect code regressions
• All engine micro-operations are finely calibrated to live
within the fixed budgets of system resources
100,000,000
1,000,000,000
10,000,000,000
100,000,000,000
1,000,000,000,000
sum_TotalRequestCharge
Consumed
RUs
3 trillion transactions in 3 days across 20 Azure regions
Central US West US East US East US 2 North Europe West Europe
North Central US East Asia South Central US Southeast Asia Australia East West US 2
West Central US Japan East Brazil South UK South South India Canada Central
Australia Southeast UK West Japan West Germany Central West India Central India
Korea South Canada East Korea Central Central US EUAP East US 2 EUAP Germany Northeast
Microsoft internal workloads 9/7/2017-9/10/2017
Highest throughput consumption between
9/7/2017- 9/10/2017 in Central US
Zooming into one of clusters in Central US
N-102 is
serving
more RUs
TenantsonN-102inCentralUS
Multi-tenancy and horizontal partitioning
Multi-tenancy and global distribution
6. Conclusion
Conclusion
• Cosmos is Microsoft’s globally distributed database service
• Foundational Azure service on which other Azure services depend on
• Battle tested by mission critical, globally distributed workloads and
massive scale
• Multi-master replication, global distribution, fine-grained resource
governance and responsive partition management are central to its
architecture
• Precisely defined multiple consistency levels with clear tradeoffs
• Comprehensive and stringent SLAs spanning consistency, latency,
throughput and availability
7. References
References
• Azure Cosmos DB, http://cosmosdb.com
• Consistency Levels in Azure Cosmos DB, https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-
levels
• “Schema-Agnostic Indexing with Azure DocumentDB,” PVLDB 8(12), 1668-1679 (2016)
• D. Terry. Replicated data consistency explained through baseball, Microsoft Technical Report MSR-TR-2011-
137, October 2011. To appear in Communications of the ACM, December 2013.
• Lisa Glendenning , Ivan Beschastnikh , Arvind Krishnamurthy , Thomas Anderson, Scalable consistency in
Scatter, Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, October 23-26,
2011, Cascais, Portugal
• Moni Naor , Avishai Wool, The Load, Capacity, and Availability of Quorum Systems, SIAM Journal on
Computing, v.27 n.2, p.423-447, April 1998
• Douglas B. Terry, Alan J. Demers, Karin Petersen, Mike Spreitzer, Marvin Theimer, and Brent B. Welch. 1994.
Session Guarantees for Weakly Consistent Replicated Data. In Parallel and Distributed Information Systems
(PDIS). 140–149.
• Peter Bailis , Shivaram Venkataraman , Michael J. Franklin , Joseph M. Hellerstein , Ion Stoica,
Probabilistically bounded staleness for practical partial quorums, Proceedings of the VLDB Endowment, v.5
n.8, p.776-787, April 2012
http://joincosmosdb.com

More Related Content

What's hot (20)

PDF
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
HostedbyConfluent
 
PDF
Apache Spark & Streaming
Fernando Rodriguez
 
PDF
Disaster Recovery and High Availability with Kafka, SRM and MM2
Abdelkrim Hadjidj
 
PDF
Database Security Threats - MariaDB Security Best Practices
MariaDB plc
 
PPTX
Alfresco tuning part2
Luis Cabaceira
 
PPTX
Kubernetes #1 intro
Terry Cho
 
PDF
RedisConf18 - Redis on Google Cloud Platform
Redis Labs
 
PPTX
Intro to Azure DevOps
Lorenzo Barbieri
 
PPTX
What is Docker
Pavel Klimiankou
 
PDF
Top 5 mistakes when writing Spark applications
hadooparchbook
 
PDF
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
HostedbyConfluent
 
PPTX
Service Discovery In Kubernetes
Knoldus Inc.
 
PPTX
AWS Lambda
Muhammed YALÇIN
 
PPTX
Docker 101 : Introduction to Docker and Containers
Yajushi Srivastava
 
PDF
AWS Backup을 이용한 데이터베이스의 백업 자동화와 편리한 복구방법
Amazon Web Services Korea
 
PPTX
Reactive programming intro
Ahmed Ehab AbdulAziz
 
PPTX
Integrating Splunk into your Spring Applications
Damien Dallimore
 
PDF
Kubernetes
Meng-Ze Lee
 
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
PPTX
Containerization & Docker - Under the Hood
Imesha Sudasingha
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
HostedbyConfluent
 
Apache Spark & Streaming
Fernando Rodriguez
 
Disaster Recovery and High Availability with Kafka, SRM and MM2
Abdelkrim Hadjidj
 
Database Security Threats - MariaDB Security Best Practices
MariaDB plc
 
Alfresco tuning part2
Luis Cabaceira
 
Kubernetes #1 intro
Terry Cho
 
RedisConf18 - Redis on Google Cloud Platform
Redis Labs
 
Intro to Azure DevOps
Lorenzo Barbieri
 
What is Docker
Pavel Klimiankou
 
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
HostedbyConfluent
 
Service Discovery In Kubernetes
Knoldus Inc.
 
AWS Lambda
Muhammed YALÇIN
 
Docker 101 : Introduction to Docker and Containers
Yajushi Srivastava
 
AWS Backup을 이용한 데이터베이스의 백업 자동화와 편리한 복구방법
Amazon Web Services Korea
 
Reactive programming intro
Ahmed Ehab AbdulAziz
 
Integrating Splunk into your Spring Applications
Damien Dallimore
 
Kubernetes
Meng-Ze Lee
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Containerization & Docker - Under the Hood
Imesha Sudasingha
 

Similar to Cosmos DB at VLDB 2019 (20)

PPTX
Technical overview of Azure Cosmos DB
Microsoft Tech Community
 
PPTX
Tech-Spark: Exploring the Cosmos DB
Ralph Attard
 
PDF
Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)
Bob Pusateri
 
PDF
Azure Cosmos DB - The Swiss Army NoSQL Cloud Database
BizTalk360
 
PDF
Select Stars: A SQL DBA's Introduction to Azure Cosmos DB (SQL Saturday Orego...
Bob Pusateri
 
PDF
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Bob Pusateri
 
PDF
Azure Cosmos DB - Technical Deep Dive
Andre Essing
 
PDF
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Andre Essing
 
PDF
Lessons learnt from building a globally distributed database service from the...
J On The Beach
 
PPTX
Tour de France Azure PaaS 3/7 Stocker des informations
Alex Danvy
 
PPTX
cosmodb ppt.pptxfkhkfsgkhgfkfghkhsadaljlsfdfhkgjh
Central University of South Bihar
 
PPTX
Azure Cosmos DB - NET Conf AR 2017 - English
Matias Quaranta
 
PPTX
Azure Cosmos DB + Gremlin API in Action
Denys Chamberland
 
PPTX
Survey of the Microsoft Azure Data Landscape
Ike Ellis
 
PPTX
cosmodb ppt personal.pptxgskjhkjsfgkhkjgskhk
Central University of South Bihar
 
PPTX
Patterns in the cloud
David Manning
 
PPTX
Cosmos db
Martino Bordin
 
PPTX
Beyond Jurassic NoSQL: New Designs for a New World
ScyllaDB
 
PPTX
Globally Distributed Modern Apps using Azure Cosmos DB and Azure Functions
Mohammad Asif
 
PDF
Modeling data and best practices for the Azure Cosmos DB.
Mohammad Asif
 
Technical overview of Azure Cosmos DB
Microsoft Tech Community
 
Tech-Spark: Exploring the Cosmos DB
Ralph Attard
 
Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)
Bob Pusateri
 
Azure Cosmos DB - The Swiss Army NoSQL Cloud Database
BizTalk360
 
Select Stars: A SQL DBA's Introduction to Azure Cosmos DB (SQL Saturday Orego...
Bob Pusateri
 
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Bob Pusateri
 
Azure Cosmos DB - Technical Deep Dive
Andre Essing
 
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Andre Essing
 
Lessons learnt from building a globally distributed database service from the...
J On The Beach
 
Tour de France Azure PaaS 3/7 Stocker des informations
Alex Danvy
 
cosmodb ppt.pptxfkhkfsgkhgfkfghkhsadaljlsfdfhkgjh
Central University of South Bihar
 
Azure Cosmos DB - NET Conf AR 2017 - English
Matias Quaranta
 
Azure Cosmos DB + Gremlin API in Action
Denys Chamberland
 
Survey of the Microsoft Azure Data Landscape
Ike Ellis
 
cosmodb ppt personal.pptxgskjhkjsfgkhkjgskhk
Central University of South Bihar
 
Patterns in the cloud
David Manning
 
Cosmos db
Martino Bordin
 
Beyond Jurassic NoSQL: New Designs for a New World
ScyllaDB
 
Globally Distributed Modern Apps using Azure Cosmos DB and Azure Functions
Mohammad Asif
 
Modeling data and best practices for the Azure Cosmos DB.
Mohammad Asif
 
Ad

Recently uploaded (20)

PDF
Book.pdf01_Intro.ppt algorithm for preperation stu used
archu26
 
PDF
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
PPTX
Element 7. CHEMICAL AND BIOLOGICAL AGENT.pptx
merrandomohandas
 
PPTX
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
PPTX
Snet+Pro+Service+Software_SNET+Pro+2+Instructions.pptx
jenilsatikuvar1
 
PPTX
Break Statement in Programming with 6 Real Examples
manojpoojary2004
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PDF
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
PDF
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
PPTX
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
PDF
Design Thinking basics for Engineers.pdf
CMR University
 
PDF
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
PPT
PPT2_Metal formingMECHANICALENGINEEIRNG .ppt
Praveen Kumar
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PDF
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
DOC
MRRS Strength and Durability of Concrete
CivilMythili
 
PPTX
Day2 B2 Best.pptx
helenjenefa1
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
Book.pdf01_Intro.ppt algorithm for preperation stu used
archu26
 
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
Element 7. CHEMICAL AND BIOLOGICAL AGENT.pptx
merrandomohandas
 
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
Snet+Pro+Service+Software_SNET+Pro+2+Instructions.pptx
jenilsatikuvar1
 
Break Statement in Programming with 6 Real Examples
manojpoojary2004
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
Design Thinking basics for Engineers.pdf
CMR University
 
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
PPT2_Metal formingMECHANICALENGINEEIRNG .ppt
Praveen Kumar
 
Thermal runway and thermal stability.pptx
godow93766
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
MRRS Strength and Durability of Concrete
CivilMythili
 
Day2 B2 Best.pptx
helenjenefa1
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
Ad

Cosmos DB at VLDB 2019

  • 1. Azure Cosmos DB Microsoft’s globally distributed database service
  • 2. Agenda 1. Introduction 2. System model 3. Global distribution 4. Resource governance 5. Conclusion
  • 4. Introduction • Started in 2010 as Project Florence • Generally available since 2017 • Ring-0, foundational Azure service, operating in all (54+) Azure regions and all Azure clouds (DoD, Sovereign, etc.), serving 10s of trillions requests/day • Became ubiquitous inside Microsoft and is one of the fastest growing Azure services
  • 5. Database designed for the cloud Global Distribution 1 Elastic and unlimited scalability Millions of transactions/sec Petabytes of data Hundreds transactions/sec Gigabytes of data 2 Cost efficiencies with fine-grained multi-tenancy 3
  • 6. Design goals 1. Elastically scale throughput on-demand across any number of Azure regions around the world, within 5s at the 99th percentile 2. Deliver <10ms end-to-end client-server read/write latencies at the 99th percentile, in any Azure region 3. Offer 99.999% read/write high availability 4. Provide tunable consistency models for developers 5. Operate at a very low cost 6. Provide strict performance isolation between transactional and analytical workloads 7. Build schema-agnostic engine to support unbounded schema versioning 8. Support multiple data models, and multiple popular OSS database APIs all operating on the same underlying data
  • 7. • 3 . • 4 . • 5 . 6 . Globally Distributed Multi-Model Database service 7 Azure Cosmos DB
  • 9. Logical system model • Using their Azure subscription, tenants create one or more Cosmos accounts • Customers can associate one or more Azure regions with their Cosmos account and specify the API for interacting with their data. • Currently supported APIs – MongoDB, Cassandra, SQL, Gremlin, Etcd, Spark, etc. • Cosmos account manages one or more databases • Depending on the API, a database manages one or more tables (or collections or graphs) • Tables, Collections and Graphs get translated to Containers • Containers are schema-agnostic repositories of Items • Containers are horizontally-partitioned based on throughput and storage • A partition consists of replica set and is a strongly consistent and highly available • A partition-set consists of partitions globally distributed across multiple Azure regions Tenants
  • 10. Physical system model • Cosmos DB service is bootstrapped in a DC via the CCP (Cosmic Control Plane) • CCP is fully decentralized and with its state replicated using another internal instance of Cosmos DB • In each DC, multiple Cosmos compute and storage clusters • Each cluster is spread across 10-20 fault domains and 3 AZs • Continuous capacity management with dynamic, predictive x-cluster, x-DC load balancers • Each machine hosts replicas belonging to 200-500 tenants • Each replica hosts an instance of Cosmos database engine
  • 11. Database Engine – key ideas Index is a union of materialized trees -> Automatic indexing Materializing ingested content as a tree → blurs the boundary between schema and instance value Schemas are speculatively/dynamically inferred from the content 1 2 3 Avro Parquet
  • 12. Database Engine – key ideas Index is decoupled from the content; content separately stored in row & column stores; all fully resource governed, log structured and latch-free Schema agnostic and version resilient storage encoding and data model for content -> multi-model engine User specified TTL policies for each store; strict performance isolation between OLTP & analytical workloads 4 5 6 JSON BSON Parquet CQL SQL ProtoBuf Data model of an item Thrift Column compressed, append-only redo- log + invalidation bitmaps stored on inexpensive, off-cluster storage row store column store Committed updates stored on local SSDs Tiering
  • 13. Database Engine – key ideas Multiple APIs get mapped onto Query IL Logical snapshots over the redo-log for both, time-travel queries as well as PITR Fully resource governed; capable of operating within frugal resource budgets 7 9 10 column store Blind writes via commutative Merge 8 Query IL t snapshots
  • 16. Partition as a Lego block for coordination • Partition is used as a strongly consistent, highly available, resource-governed, reusable building block to solve many coordination problems in the system • Global replication • Inter-cluster partition load balancing • Partition management operations including split, merge, clone • On-demand streaming • Nested consensus • Two reconfiguration state machines with failure detectors composing with each other • Dynamically self-adjusting membership and quorums • Built-in resource governance • Leaders and followers are allotted different budgets • Leaders are unburdened from serving reads Split Merge Global replication
  • 17. Global distribution • The distribution protocol propagates writes within a partition-set, which typically spans over multiple regions • Instead of stretching a replica-set across all regions, the protocol maintains a nested replica-set in each region • Significantly improves fault tolerance and avoids in-cast bottlenecks due to hierarchical aggregation of acks • Massive, dynamic connected graph of partitions stretching across Azure datacenters worldwide • Control-plane ensures strong routing consistency despite continuous addition and removal of regions, partition load balancing, splits, merge etc. • Topology state is made highly available directly inside the data plane • Guaranteed low latency and high availability reads and writes with multi-master replication • Reads and writes are always local to the region • Guaranteed RPO and RTO of 0 • Multi-homing with location aware routing, programmatic support to add/remove regions, simulate regional disaster, trigger failover etc. • Five precisely defined consistency levels: strong, bounded-staleness, session, prefix, and eventual consistency • The consistency level selected determines the read quorum size within the replica-set in local region • Distribution protocol touches each write only once to achieve the immediate commit within a region upon receipt • The protocol remains scale independent and latency insensitive with N
  • 18. Consensus within a replica set 1 2 3 4 5 6 7 8 leader - Receive request and start txn, validate & analyze content, generate schemas, generate index terms leader - Update index store, update content store leader - Append redo-log leader - Replicate to followers follower/xp-leader - start txn, apply updates to index and content stores, update redo log, commit txn, replicate to remote- partitions (xp-leader) follower - Send ack to the leader leader - Receive the quorum ack leader - commit txn leader - Columnar compression of redo records, apply invalidations, group commit to column store leader - Send response to the client 9 Partition
  • 19. Consensus within a partition set 1 2 2 1 4 4 1 1 • All writes are durably committed in local quorum • Writes are visible immediately in the region • Change log of tentative writes is shipped for merge via anti- entropy • Each region maintains sequence number for tentative writes • Version vector clock captures the global progress • Record level conflicts are detected using vector clock comparison • Conflict resolution is performed at a dynamically picked arbiter • Results of conflict resolutions are shipped to all regions to ensure guaranteed convergence P1 – write [1,0,0], P2 – write [0,a,0] and P3 - write [0,0,x] P1 – apply [1,0,0] locally and xp-replicate [1,0,0] to P2 and P3 P2 - apply [0,a,0][1,0,0] locally and P3 – apply [0,0,x] [1,0,0] locally P2 - merge [0,a,0] and P3 – merge [0,0,x] P1 – merge [0,a,0][0,0,x] P1 – conflict resolution [1,a,x] 2 2 2 2 3 3 3 3 3 3 3 5 4 5 6 6 P1 P2 P3
  • 20. Precisely defined consistency levels Strong Bounded-staleness Session Consistent-prefix Eventual
  • 21. Continuous consistency checking over live data Cosmos DB Cluster Consistency Checker Service Request logs Metrics DB Azure Portal Consistency metrics Blue vertices -> Writes Green vertices -> Reads Linearizable History Dependency graph of operations Customer Requests Observed consistency is reported to customers along with the Probability Bounded Staleness (PBS) metric 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 500 1000 1500 2000 2500 P Time After Commit(ms) P(Consistency) US West US East North Europe
  • 23. Fully resource governed stack • Stateless communication protocols everywhere with fixed upper bound on processing latency • Capacity management, COGS and SLAs all depend on stringent resource governance across the entire stack • Request Unit (RU) • Rate based currency (/sec, /min, /hr) • Normalized across various database operations • ML pipeline to calculate the query charges across different datasets and query patterns • Need to remain consistent across hardware generations • Automated perf and RG tracking every four hours to detect code regressions • All engine micro-operations are finely calibrated to live within the fixed budgets of system resources
  • 24. 100,000,000 1,000,000,000 10,000,000,000 100,000,000,000 1,000,000,000,000 sum_TotalRequestCharge Consumed RUs 3 trillion transactions in 3 days across 20 Azure regions Central US West US East US East US 2 North Europe West Europe North Central US East Asia South Central US Southeast Asia Australia East West US 2 West Central US Japan East Brazil South UK South South India Canada Central Australia Southeast UK West Japan West Germany Central West India Central India Korea South Canada East Korea Central Central US EUAP East US 2 EUAP Germany Northeast Microsoft internal workloads 9/7/2017-9/10/2017 Highest throughput consumption between 9/7/2017- 9/10/2017 in Central US
  • 25. Zooming into one of clusters in Central US N-102 is serving more RUs
  • 28. Multi-tenancy and global distribution
  • 30. Conclusion • Cosmos is Microsoft’s globally distributed database service • Foundational Azure service on which other Azure services depend on • Battle tested by mission critical, globally distributed workloads and massive scale • Multi-master replication, global distribution, fine-grained resource governance and responsive partition management are central to its architecture • Precisely defined multiple consistency levels with clear tradeoffs • Comprehensive and stringent SLAs spanning consistency, latency, throughput and availability
  • 32. References • Azure Cosmos DB, http://cosmosdb.com • Consistency Levels in Azure Cosmos DB, https://docs.microsoft.com/en-us/azure/cosmos-db/consistency- levels • “Schema-Agnostic Indexing with Azure DocumentDB,” PVLDB 8(12), 1668-1679 (2016) • D. Terry. Replicated data consistency explained through baseball, Microsoft Technical Report MSR-TR-2011- 137, October 2011. To appear in Communications of the ACM, December 2013. • Lisa Glendenning , Ivan Beschastnikh , Arvind Krishnamurthy , Thomas Anderson, Scalable consistency in Scatter, Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, October 23-26, 2011, Cascais, Portugal • Moni Naor , Avishai Wool, The Load, Capacity, and Availability of Quorum Systems, SIAM Journal on Computing, v.27 n.2, p.423-447, April 1998 • Douglas B. Terry, Alan J. Demers, Karin Petersen, Mike Spreitzer, Marvin Theimer, and Brent B. Welch. 1994. Session Guarantees for Weakly Consistent Replicated Data. In Parallel and Distributed Information Systems (PDIS). 140–149. • Peter Bailis , Shivaram Venkataraman , Michael J. Franklin , Joseph M. Hellerstein , Ion Stoica, Probabilistically bounded staleness for practical partial quorums, Proceedings of the VLDB Endowment, v.5 n.8, p.776-787, April 2012