SlideShare a Scribd company logo
How Microsoft built
and scaled Cosmos
Cosmos
— Cosmos is a large Scale Data processing system
— In use by thousands of internal users at Microsoft
— Distributed filesystem contains exabytes of data
— High-level SQL-like language to run jobs
processing up to petabytes at a time
Outline
— What made Cosmos successful
— Language
— Data sharing
— Technical Challenges
— Scalability challenges and architecture
— Supporting lower latency workload
— Conclusion
Language: Scope
— SQL-Like language
— Support structured data and unstructured data
— Easy to use and learn
Q = SSTREAM “queries.ss”;
U = SSTREAM “users.ss”;
J= SELECT *, Math.Round(Q.latency) AS l
FROM Q,U WHERE Q.uid==U.uid;
OUTPUT J TO “output.txt”
“SCOPE: Parallel Databases Meet MapReduce” Jingren Zhou, Nicolas Bruno, Ming-chuan Wu, Paul Larson, Ronnie Chaiken, Darren Shakib,
The VLDB Journal, 2012
Scope
— C# extensibility
— Supports user defined objects
input = EXTRACT user, session, blob
FROM "log_%n.txt?n=1...10"
USING DefaultTextExtractor;
SELECT user, session,
new RequestInfo(blob) AS request
FROM input
WHERE
request.Browser.IsChrome()
“SCOPE: Parallel Databases Meet MapReduce” Jingren Zhou, Nicolas Bruno, Ming-chuan Wu, Paul Larson, Ronnie Chaiken, Darren Shakib,
The VLDB Journal, 2012
Scope Distributed Execution
— Queries are parsed into a logical operator tree
— The optimizer transforms the query into a physical
operator graph, which is then compiled into
binaries
— The physical operator graph and binaries are
handed to a scheduler for execution
“Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing” Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping
Qian, Ming Wu, and Lidong Zhou, in Proc. of the 2014 OSDI Conference (OSDI'14)
Scope Optimizer
Data Sharing
— Users share data by reference
— Teams put their data in Cosmos because that is
where the data they want to join against is
— Skype, Windows, Xbox, Bing, Ads, Office, and
more
http://research.microsoft.com/en-us/events/fs2011/helland_cosmos_big_data_and_big_challenges.pdf
https://azure.microsoft.com/en-us/blog/behind-the-scenes-of-azure-data-lake-bringing-microsoft-s-big-data-experience-to-
hadoop/
“Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing” Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping
Qian, Ming Wu, and Lidong Zhou, in Proc. of the 2014 OSDI Conference (OSDI'14)
Network Effect
• Teams put their data in Cosmos because that is
where the data they want to join against is
http://research.microsoft.com/en-us/events/fs2011/helland_cosmos_big_data_and_big_challenges.pdf
JETS operates a high-scale, modern data pipeline for Office
Telemetry data from clients and services are combined into both
custom (app domain specific) and common System Health data
sets in Cosmos.
organization reports to surface [..] release risks and telemetry information
telemetry information using map/reduce COSMOS
WSD organization is responsible for delivering security and non-
security fixes to Windows OSes to billions of customers, every
month on patch Tuesday
Are you interested in building the BI platform for Bing Ads?
Experience with working on C#, C++, or Java, Cosmos, is highly
desirable.
Are you excited about delivering the next generation personal
assistant, Cortana, to millions of people using Windows worldwide?
Experience with“Big Data” technologies like Cosmos
Data Sharing
— Users share data by reference
— Teams put their data in Cosmos because that is
where the data they want to join against is
— Skype, Windows, Xbox, Bing, Ads, Office, and
more
— This drives huge scalability requirements
— Cluster size exceed 50,000 servers
http://research.microsoft.com/en-us/events/fs2011/helland_cosmos_big_data_and_big_challenges.pdf
https://azure.microsoft.com/en-us/blog/behind-the-scenes-of-azure-data-lake-bringing-microsoft-s-big-data-experience-to-
hadoop/
“Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing” Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping
Qian, Ming Wu, and Lidong Zhou, in Proc. of the 2014 OSDI Conference (OSDI'14)
Outline
— What made Cosmos successful
— Language
— Data sharing
— Technical Challenges
— Scalability challenges and architecture
— Supporting lower latency workload
— Conclusion
Plan Optimizations
— At large scale, query plan manipulations are
required to improve efficiency of sort, aggregation
and broadcast
Aggregation
Broadcast Joins
Parallel Sort
Scaling the Execution:
Apollo (OSDI’14)
— A large number of users share execution resources
for data locality
— How to minimize latency while maximizing cluster
utilization?
— Challenges:
— Scale
— Heterogeneous workload
— Maximizing utilization
“Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing” Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping
Qian, Ming Wu, and Lidong Zhou, in Proc. of the 2014 OSDI Conference (OSDI'14)
Heterogeneous Workload
Dynamic Workload
How to effectively use resources while maintaining
performance guarantees with a dynamic workload?
Architecture
— For scalability, the architecture adopts a fully
decentralized control plane
— Each job has its own scheduler instance
— Each scheduler is making independent decisions
informed by global information
Architecture
•
Scheduler:
There is one scheduler per job for
scalability
The scheduler makes local
decision and directly dispatch
tasks to process nodes
Architecture
Process Nodes:
Execute tasks on behalf of job
managers
Provides local resource isolation
Send status update aggregated
by a resource monitor
Architecture
Resource Monitor:
Aggregates status information
from process node
Provides the cluster load
information to schedulers to
inform future scheduling
Architecture
The queue at the PN allows the scheduler
to reason about future resource availability
Representing Load
— How to concisely represent load?
— Represents the expected wait time to
acquire resources
— Integrated into a scheduler cost model
Optimizing for various
factors
Optimizing for various
factors
— To make optimal scheduling decisions, multiple factors have to be
considered at the same time
— Input location
— Network topology
— Wait time
— Initialization time
— Machine health, probability of failure
Scheduler Performance
Ideal scheduler (Capacity Constraint)
Ideal Scheduler (Infinite Capacity)
Baseline
Apollo
The Cosmos scheduler performs within 5% of the
ideal trace driven scheduler
Utilization
Cosmos maintains a median utilization above 80%
on weekdays while supporting latency-sensitive
workloads
More in the paper
— Scheduler cost model
— Opportunistic scheduling
— Stable matching
“Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing”
Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian,
Ming Wu, and Lidong Zhou, in Proc. of the 2014 OSDI Conference (OSDI'14)
Outline
— What made Cosmos successful
— Language
— Data sharing
— Technical Challenges
— Scalability challenges and architecture
— Supporting lower latency workload
— Conclusion
Supporting lower latency
workloads
— As the customer base increased, the workload
diversified
— Users request the ability to get interactive
latencies, on the same data
— While Apollo can scale to jobs processing
petabytes of data, it has undesirable overhead for
smaller jobs
Supporting lower latency
workloads
— How to provide interactive latencies at cloud scale?
— How to provide fault tolerance in an interactive
context?
JetScope (VLDB ’15)
— Provide interactive capabilities on Cosmos &
Scope
— Paradigm shift in the execution model:
— Stream intermediate results
— Gang scheduling
Intermediate Results
Streaming
— JetScope avoids materializing intermediates to disk
— Tasks writes to a local service, StreamNet, which
manages communications
— Challenges:
— Deadlock on ordered merge when using finite
communication buffers
— Too many connections
Gang Scheduling
— To achieve minimal latency, JetScope starts all
tasks at the same time (gang scheduling)
— Execution overlap in tasks allows an increase in
parallelism
— Challenge: Scheduler deadlock
— Two schedulers incrementally acquire resources
— Resources run out, neither jobs can execute
— Solution: Admission control
—Chance of failure increases with number of
servers touched
—A job could fail repeatedly and never complete
—We need a fault tolerance mechanism that
doesn’t impact performance
—Details are in the paper
39
Fault Tolerance
How does JetScope scale?
Latency(seconds)
0
13
25
38
50
Q1 Q4 Q6 Q12 Q15
10TB with 200 servers 1TB with 20 servers
Similar latency
after 10x scale
increase
40
Conclusion
—Cosmos is a large scale distributed data processing
system
—Store exabytes of data on many clusters, that can
contain over 50,000 servers
—Provides both batch processing and interactive
processing
—Has a fully decentralized control plane for scalability
—Operates a high utilization to maintain low query
cost
41

More Related Content

What's hot (20)

PDF
Achieving scale and performance using cloud native environment
Rakuten Group, Inc.
 
PPTX
Introducing MemSQL 4
SingleStore
 
PPTX
Modeling the Smart and Connected City of the Future with Kafka and Spark
SingleStore
 
PDF
Data integration with Apache Kafka
confluent
 
PDF
Journey to the Real-Time Analytics in Extreme Growth
SingleStore
 
PPTX
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf
 
PPTX
Real-Time Analytics with Spark and MemSQL
SingleStore
 
PDF
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
confluent
 
PDF
The Future of ETL - Strata Data New York 2018
confluent
 
PDF
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
HostedbyConfluent
 
PDF
The Future of ETL Isn't What It Used to Be
confluent
 
PDF
Leveraging Mainframe Data for Modern Analytics
confluent
 
PDF
Introduction to MemSQL
SingleStore
 
PDF
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
confluent
 
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
HostedbyConfluent
 
PDF
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
HostedbyConfluent
 
PDF
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
HostedbyConfluent
 
PDF
Stateful, Stateless and Serverless - Running Apache Kafka® on Kubernetes
confluent
 
PDF
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
confluent
 
PDF
CDC patterns in Apache Kafka®
confluent
 
Achieving scale and performance using cloud native environment
Rakuten Group, Inc.
 
Introducing MemSQL 4
SingleStore
 
Modeling the Smart and Connected City of the Future with Kafka and Spark
SingleStore
 
Data integration with Apache Kafka
confluent
 
Journey to the Real-Time Analytics in Extreme Growth
SingleStore
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf
 
Real-Time Analytics with Spark and MemSQL
SingleStore
 
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
confluent
 
The Future of ETL - Strata Data New York 2018
confluent
 
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
HostedbyConfluent
 
The Future of ETL Isn't What It Used to Be
confluent
 
Leveraging Mainframe Data for Modern Analytics
confluent
 
Introduction to MemSQL
SingleStore
 
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
confluent
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
HostedbyConfluent
 
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
HostedbyConfluent
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
HostedbyConfluent
 
Stateful, Stateless and Serverless - Running Apache Kafka® on Kubernetes
confluent
 
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
confluent
 
CDC patterns in Apache Kafka®
confluent
 

Similar to How Microsoft Built and Scaled Cosmos (20)

PPTX
IT TRENDS AND PERSPECTIVES 2016
Vaidheswaran CS
 
PPTX
Manage Microservices & Fast Data Systems on One Platform w/ DC/OS
Mesosphere Inc.
 
PPT
云计算及其应用
lantianlcdx
 
DOC
Mihai_Nuta
Mihai Nuta
 
PDF
FredMcLainResumeB
Fred McLain
 
PPTX
Above the cloud joarder kamal
Joarder Kamal
 
PDF
Containerizing couchbase with microservice architecture on mesosphere.pptx
Ravi Yadav
 
PDF
ZCloud Consensus on Hardware for Distributed Systems
Gokhan Boranalp
 
PDF
newSkills_09
Yue Chao Qin
 
PDF
Scientific Cloud Computing: Present & Future
stratuslab
 
DOC
Santosh kumarpandi
Santosh Kumar Pandi
 
PPTX
QoE-Aware Traffic Steering using OpenFlow
US-Ignite
 
PDF
Karthik Balasubramanian (Resume)
karthik_bala
 
PPTX
Designing High performance & Scalable Middleware for HPC
Object Automation
 
PPT
CloudComputingJun28.ppt
Vipin Singhal
 
PPT
CloudComputingJun28.ppt
TamilKnowledgebase
 
PPT
Cloud Computing: Concepts, Technologies and Business Implications
ssuser1cafbd1
 
PPT
CloudComputingJun28.ppt
geminass1
 
DOCX
Naresh Babu
Naresh Babu Keshava
 
PDF
Introduction to Cloud Computing
Animesh Chaturvedi
 
IT TRENDS AND PERSPECTIVES 2016
Vaidheswaran CS
 
Manage Microservices & Fast Data Systems on One Platform w/ DC/OS
Mesosphere Inc.
 
云计算及其应用
lantianlcdx
 
Mihai_Nuta
Mihai Nuta
 
FredMcLainResumeB
Fred McLain
 
Above the cloud joarder kamal
Joarder Kamal
 
Containerizing couchbase with microservice architecture on mesosphere.pptx
Ravi Yadav
 
ZCloud Consensus on Hardware for Distributed Systems
Gokhan Boranalp
 
newSkills_09
Yue Chao Qin
 
Scientific Cloud Computing: Present & Future
stratuslab
 
Santosh kumarpandi
Santosh Kumar Pandi
 
QoE-Aware Traffic Steering using OpenFlow
US-Ignite
 
Karthik Balasubramanian (Resume)
karthik_bala
 
Designing High performance & Scalable Middleware for HPC
Object Automation
 
CloudComputingJun28.ppt
Vipin Singhal
 
CloudComputingJun28.ppt
TamilKnowledgebase
 
Cloud Computing: Concepts, Technologies and Business Implications
ssuser1cafbd1
 
CloudComputingJun28.ppt
geminass1
 
Naresh Babu
Naresh Babu Keshava
 
Introduction to Cloud Computing
Animesh Chaturvedi
 
Ad

More from SingleStore (20)

PPTX
Five ways database modernization simplifies your data life
SingleStore
 
PPTX
How Kafka and Modern Databases Benefit Apps and Analytics
SingleStore
 
PDF
Architecting Data in the AWS Ecosystem
SingleStore
 
PPTX
Building the Foundation for a Latency-Free Life
SingleStore
 
PDF
Converging Database Transactions and Analytics
SingleStore
 
PDF
Building a Machine Learning Recommendation Engine in SQL
SingleStore
 
PPTX
MemSQL 201: Advanced Tips and Tricks Webcast
SingleStore
 
PDF
An Engineering Approach to Database Evaluations
SingleStore
 
PPTX
Building a Fault Tolerant Distributed Architecture
SingleStore
 
PDF
Stream Processing with Pipelines and Stored Procedures
SingleStore
 
PPTX
Curriculum Associates Strata NYC 2017
SingleStore
 
PPTX
Image Recognition on Streaming Data
SingleStore
 
PPTX
Spark Summit Dublin 2017 - MemSQL - Real-Time Image Recognition
SingleStore
 
PDF
The State of the Data Warehouse in 2017 and Beyond
SingleStore
 
PDF
How Database Convergence Impacts the Coming Decades of Data Management
SingleStore
 
PPTX
Teaching Databases to Learn in the World of AI
SingleStore
 
PDF
Gartner Catalyst 2017: The Data Warehouse Blueprint for ML, AI, and Hybrid Cloud
SingleStore
 
PPTX
Gartner Catalyst 2017: Image Recognition on Streaming Data
SingleStore
 
PPTX
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
SingleStore
 
PDF
Real-Time Analytics at Uber Scale
SingleStore
 
Five ways database modernization simplifies your data life
SingleStore
 
How Kafka and Modern Databases Benefit Apps and Analytics
SingleStore
 
Architecting Data in the AWS Ecosystem
SingleStore
 
Building the Foundation for a Latency-Free Life
SingleStore
 
Converging Database Transactions and Analytics
SingleStore
 
Building a Machine Learning Recommendation Engine in SQL
SingleStore
 
MemSQL 201: Advanced Tips and Tricks Webcast
SingleStore
 
An Engineering Approach to Database Evaluations
SingleStore
 
Building a Fault Tolerant Distributed Architecture
SingleStore
 
Stream Processing with Pipelines and Stored Procedures
SingleStore
 
Curriculum Associates Strata NYC 2017
SingleStore
 
Image Recognition on Streaming Data
SingleStore
 
Spark Summit Dublin 2017 - MemSQL - Real-Time Image Recognition
SingleStore
 
The State of the Data Warehouse in 2017 and Beyond
SingleStore
 
How Database Convergence Impacts the Coming Decades of Data Management
SingleStore
 
Teaching Databases to Learn in the World of AI
SingleStore
 
Gartner Catalyst 2017: The Data Warehouse Blueprint for ML, AI, and Hybrid Cloud
SingleStore
 
Gartner Catalyst 2017: Image Recognition on Streaming Data
SingleStore
 
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
SingleStore
 
Real-Time Analytics at Uber Scale
SingleStore
 
Ad

Recently uploaded (20)

PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 

How Microsoft Built and Scaled Cosmos

  • 1. How Microsoft built and scaled Cosmos
  • 2. Cosmos — Cosmos is a large Scale Data processing system — In use by thousands of internal users at Microsoft — Distributed filesystem contains exabytes of data — High-level SQL-like language to run jobs processing up to petabytes at a time
  • 3. Outline — What made Cosmos successful — Language — Data sharing — Technical Challenges — Scalability challenges and architecture — Supporting lower latency workload — Conclusion
  • 4. Language: Scope — SQL-Like language — Support structured data and unstructured data — Easy to use and learn Q = SSTREAM “queries.ss”; U = SSTREAM “users.ss”; J= SELECT *, Math.Round(Q.latency) AS l FROM Q,U WHERE Q.uid==U.uid; OUTPUT J TO “output.txt” “SCOPE: Parallel Databases Meet MapReduce” Jingren Zhou, Nicolas Bruno, Ming-chuan Wu, Paul Larson, Ronnie Chaiken, Darren Shakib, The VLDB Journal, 2012
  • 5. Scope — C# extensibility — Supports user defined objects input = EXTRACT user, session, blob FROM "log_%n.txt?n=1...10" USING DefaultTextExtractor; SELECT user, session, new RequestInfo(blob) AS request FROM input WHERE request.Browser.IsChrome() “SCOPE: Parallel Databases Meet MapReduce” Jingren Zhou, Nicolas Bruno, Ming-chuan Wu, Paul Larson, Ronnie Chaiken, Darren Shakib, The VLDB Journal, 2012
  • 6. Scope Distributed Execution — Queries are parsed into a logical operator tree — The optimizer transforms the query into a physical operator graph, which is then compiled into binaries — The physical operator graph and binaries are handed to a scheduler for execution “Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing” Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou, in Proc. of the 2014 OSDI Conference (OSDI'14)
  • 8. Data Sharing — Users share data by reference — Teams put their data in Cosmos because that is where the data they want to join against is — Skype, Windows, Xbox, Bing, Ads, Office, and more http://research.microsoft.com/en-us/events/fs2011/helland_cosmos_big_data_and_big_challenges.pdf https://azure.microsoft.com/en-us/blog/behind-the-scenes-of-azure-data-lake-bringing-microsoft-s-big-data-experience-to- hadoop/ “Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing” Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou, in Proc. of the 2014 OSDI Conference (OSDI'14)
  • 9. Network Effect • Teams put their data in Cosmos because that is where the data they want to join against is http://research.microsoft.com/en-us/events/fs2011/helland_cosmos_big_data_and_big_challenges.pdf JETS operates a high-scale, modern data pipeline for Office Telemetry data from clients and services are combined into both custom (app domain specific) and common System Health data sets in Cosmos.
  • 10. organization reports to surface [..] release risks and telemetry information telemetry information using map/reduce COSMOS WSD organization is responsible for delivering security and non- security fixes to Windows OSes to billions of customers, every month on patch Tuesday
  • 11. Are you interested in building the BI platform for Bing Ads? Experience with working on C#, C++, or Java, Cosmos, is highly desirable.
  • 12. Are you excited about delivering the next generation personal assistant, Cortana, to millions of people using Windows worldwide? Experience with“Big Data” technologies like Cosmos
  • 13. Data Sharing — Users share data by reference — Teams put their data in Cosmos because that is where the data they want to join against is — Skype, Windows, Xbox, Bing, Ads, Office, and more — This drives huge scalability requirements — Cluster size exceed 50,000 servers http://research.microsoft.com/en-us/events/fs2011/helland_cosmos_big_data_and_big_challenges.pdf https://azure.microsoft.com/en-us/blog/behind-the-scenes-of-azure-data-lake-bringing-microsoft-s-big-data-experience-to- hadoop/ “Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing” Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou, in Proc. of the 2014 OSDI Conference (OSDI'14)
  • 14. Outline — What made Cosmos successful — Language — Data sharing — Technical Challenges — Scalability challenges and architecture — Supporting lower latency workload — Conclusion
  • 15. Plan Optimizations — At large scale, query plan manipulations are required to improve efficiency of sort, aggregation and broadcast
  • 19. Scaling the Execution: Apollo (OSDI’14) — A large number of users share execution resources for data locality — How to minimize latency while maximizing cluster utilization? — Challenges: — Scale — Heterogeneous workload — Maximizing utilization “Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing” Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou, in Proc. of the 2014 OSDI Conference (OSDI'14)
  • 21. Dynamic Workload How to effectively use resources while maintaining performance guarantees with a dynamic workload?
  • 22. Architecture — For scalability, the architecture adopts a fully decentralized control plane — Each job has its own scheduler instance — Each scheduler is making independent decisions informed by global information
  • 23. Architecture • Scheduler: There is one scheduler per job for scalability The scheduler makes local decision and directly dispatch tasks to process nodes
  • 24. Architecture Process Nodes: Execute tasks on behalf of job managers Provides local resource isolation Send status update aggregated by a resource monitor
  • 25. Architecture Resource Monitor: Aggregates status information from process node Provides the cluster load information to schedulers to inform future scheduling
  • 26. Architecture The queue at the PN allows the scheduler to reason about future resource availability
  • 27. Representing Load — How to concisely represent load? — Represents the expected wait time to acquire resources — Integrated into a scheduler cost model
  • 29. Optimizing for various factors — To make optimal scheduling decisions, multiple factors have to be considered at the same time — Input location — Network topology — Wait time — Initialization time — Machine health, probability of failure
  • 30. Scheduler Performance Ideal scheduler (Capacity Constraint) Ideal Scheduler (Infinite Capacity) Baseline Apollo The Cosmos scheduler performs within 5% of the ideal trace driven scheduler
  • 31. Utilization Cosmos maintains a median utilization above 80% on weekdays while supporting latency-sensitive workloads
  • 32. More in the paper — Scheduler cost model — Opportunistic scheduling — Stable matching “Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing” Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou, in Proc. of the 2014 OSDI Conference (OSDI'14)
  • 33. Outline — What made Cosmos successful — Language — Data sharing — Technical Challenges — Scalability challenges and architecture — Supporting lower latency workload — Conclusion
  • 34. Supporting lower latency workloads — As the customer base increased, the workload diversified — Users request the ability to get interactive latencies, on the same data — While Apollo can scale to jobs processing petabytes of data, it has undesirable overhead for smaller jobs
  • 35. Supporting lower latency workloads — How to provide interactive latencies at cloud scale? — How to provide fault tolerance in an interactive context?
  • 36. JetScope (VLDB ’15) — Provide interactive capabilities on Cosmos & Scope — Paradigm shift in the execution model: — Stream intermediate results — Gang scheduling
  • 37. Intermediate Results Streaming — JetScope avoids materializing intermediates to disk — Tasks writes to a local service, StreamNet, which manages communications — Challenges: — Deadlock on ordered merge when using finite communication buffers — Too many connections
  • 38. Gang Scheduling — To achieve minimal latency, JetScope starts all tasks at the same time (gang scheduling) — Execution overlap in tasks allows an increase in parallelism — Challenge: Scheduler deadlock — Two schedulers incrementally acquire resources — Resources run out, neither jobs can execute — Solution: Admission control
  • 39. —Chance of failure increases with number of servers touched —A job could fail repeatedly and never complete —We need a fault tolerance mechanism that doesn’t impact performance —Details are in the paper 39 Fault Tolerance
  • 40. How does JetScope scale? Latency(seconds) 0 13 25 38 50 Q1 Q4 Q6 Q12 Q15 10TB with 200 servers 1TB with 20 servers Similar latency after 10x scale increase 40
  • 41. Conclusion —Cosmos is a large scale distributed data processing system —Store exabytes of data on many clusters, that can contain over 50,000 servers —Provides both batch processing and interactive processing —Has a fully decentralized control plane for scalability —Operates a high utilization to maintain low query cost 41