How Microsoft Built and Scaled Cosmos

How Microsoft built
and scaled Cosmos

Cosmos
— Cosmos is a large Scale Data processing system
— In use by thousands of internal users at Microsoft
— Distributed ﬁlesystem contains exabytes of data
— High-level SQL-like language to run jobs
processing up to petabytes at a time

Outline
— What made Cosmos successful
— Language
— Data sharing
— Technical Challenges
— Scalability challenges and architecture
— Supporting lower latency workload
— Conclusion

Language: Scope
— SQL-Like language
— Support structured data and unstructured data
— Easy to use and learn
Q = SSTREAM “queries.ss”;
U = SSTREAM “users.ss”;
J= SELECT *, Math.Round(Q.latency) AS l
FROM Q,U WHERE Q.uid==U.uid;
OUTPUT J TO “output.txt”
“SCOPE: Parallel Databases Meet MapReduce” Jingren Zhou, Nicolas Bruno, Ming-chuan Wu, Paul Larson, Ronnie Chaiken, Darren Shakib,
The VLDB Journal, 2012

Scope
— C# extensibility
— Supports user deﬁned objects
input = EXTRACT user, session, blob
FROM "log_%n.txt?n=1...10"
USING DefaultTextExtractor;
SELECT user, session,
new RequestInfo(blob) AS request
FROM input
WHERE
request.Browser.IsChrome()
“SCOPE: Parallel Databases Meet MapReduce” Jingren Zhou, Nicolas Bruno, Ming-chuan Wu, Paul Larson, Ronnie Chaiken, Darren Shakib,
The VLDB Journal, 2012

Scope Distributed Execution
— Queries are parsed into a logical operator tree
— The optimizer transforms the query into a physical
operator graph, which is then compiled into
binaries
— The physical operator graph and binaries are
handed to a scheduler for execution
“Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing” Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping
Qian, Ming Wu, and Lidong Zhou, in Proc. of the 2014 OSDI Conference (OSDI'14)

Data Sharing
— Users share data by reference
— Teams put their data in Cosmos because that is
where the data they want to join against is
— Skype, Windows, Xbox, Bing, Ads, Ofﬁce, and
more
http://research.microsoft.com/en-us/events/fs2011/helland_cosmos_big_data_and_big_challenges.pdf
https://azure.microsoft.com/en-us/blog/behind-the-scenes-of-azure-data-lake-bringing-microsoft-s-big-data-experience-to-
hadoop/

Network Effect
• Teams put their data in Cosmos because that is
JETS operates a high-scale, modern data pipeline for Ofﬁce
Telemetry data from clients and services are combined into both
custom (app domain speciﬁc) and common System Health data
sets in Cosmos.

organization reports to surface [..] release risks and telemetry information
telemetry information using map/reduce COSMOS
WSD organization is responsible for delivering security and non-
security ﬁxes to Windows OSes to billions of customers, every
month on patch Tuesday

Are you interested in building the BI platform for Bing Ads?
Experience with working on C#, C++, or Java, Cosmos, is highly
desirable.

Are you excited about delivering the next generation personal
assistant, Cortana, to millions of people using Windows worldwide?
Experience with“Big Data” technologies like Cosmos

Data Sharing
— Users share data by reference
— Teams put their data in Cosmos because that is
— Skype, Windows, Xbox, Bing, Ads, Ofﬁce, and
more
— This drives huge scalability requirements
— Cluster size exceed 50,000 servers
https://azure.microsoft.com/en-us/blog/behind-the-scenes-of-azure-data-lake-bringing-microsoft-s-big-data-experience-to-
hadoop/

Plan Optimizations
— At large scale, query plan manipulations are
required to improve efﬁciency of sort, aggregation
and broadcast

Scaling the Execution:
Apollo (OSDI’14)
— A large number of users share execution resources
for data locality
— How to minimize latency while maximizing cluster
utilization?
— Challenges:
— Scale
— Heterogeneous workload
— Maximizing utilization

Dynamic Workload
How to effectively use resources while maintaining
performance guarantees with a dynamic workload?

Architecture
— For scalability, the architecture adopts a fully
decentralized control plane
— Each job has its own scheduler instance
— Each scheduler is making independent decisions
informed by global information

Architecture
•
Scheduler:
There is one scheduler per job for
scalability
The scheduler makes local
decision and directly dispatch
tasks to process nodes

Architecture
Process Nodes:
Execute tasks on behalf of job
managers
Provides local resource isolation
Send status update aggregated
by a resource monitor

Architecture
Resource Monitor:
Aggregates status information
from process node
Provides the cluster load
information to schedulers to
inform future scheduling

Architecture
The queue at the PN allows the scheduler
to reason about future resource availability

Representing Load
— How to concisely represent load?
— Represents the expected wait time to
acquire resources
— Integrated into a scheduler cost model

Optimizing for various
factors

Optimizing for various
factors
— To make optimal scheduling decisions, multiple factors have to be
considered at the same time
— Input location
— Network topology
— Wait time
— Initialization time
— Machine health, probability of failure

Scheduler Performance
Ideal scheduler (Capacity Constraint)
Ideal Scheduler (Inﬁnite Capacity)
Baseline
Apollo
The Cosmos scheduler performs within 5% of the
ideal trace driven scheduler

Utilization
Cosmos maintains a median utilization above 80%
on weekdays while supporting latency-sensitive
workloads

More in the paper
— Scheduler cost model
— Opportunistic scheduling
— Stable matching
“Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing”
Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian,
Ming Wu, and Lidong Zhou, in Proc. of the 2014 OSDI Conference (OSDI'14)

Supporting lower latency
workloads
— As the customer base increased, the workload
diversiﬁed
— Users request the ability to get interactive
latencies, on the same data
— While Apollo can scale to jobs processing
petabytes of data, it has undesirable overhead for
smaller jobs

Supporting lower latency
workloads
— How to provide interactive latencies at cloud scale?
— How to provide fault tolerance in an interactive
context?

JetScope (VLDB ’15)
— Provide interactive capabilities on Cosmos &
Scope
— Paradigm shift in the execution model:
— Stream intermediate results
— Gang scheduling

Intermediate Results
Streaming
— JetScope avoids materializing intermediates to disk
— Tasks writes to a local service, StreamNet, which
manages communications
— Challenges:
— Deadlock on ordered merge when using ﬁnite
communication buffers
— Too many connections

Gang Scheduling
— To achieve minimal latency, JetScope starts all
tasks at the same time (gang scheduling)
— Execution overlap in tasks allows an increase in
parallelism
— Challenge: Scheduler deadlock
— Two schedulers incrementally acquire resources
— Resources run out, neither jobs can execute
— Solution: Admission control

—Chance of failure increases with number of
servers touched
—A job could fail repeatedly and never complete
—We need a fault tolerance mechanism that
doesn’t impact performance
—Details are in the paper
39
Fault Tolerance

How does JetScope scale?
Latency(seconds)
0
13
25
38
50
Q1 Q4 Q6 Q12 Q15
10TB with 200 servers 1TB with 20 servers
Similar latency
after 10x scale
increase
40

Conclusion
—Cosmos is a large scale distributed data processing
system
—Store exabytes of data on many clusters, that can
contain over 50,000 servers
—Provides both batch processing and interactive
processing
—Has a fully decentralized control plane for scalability
—Operates a high utilization to maintain low query
cost
41

How Microsoft Built and Scaled Cosmos

More Related Content

What's hot (20)

Similar to How Microsoft Built and Scaled Cosmos (20)

More from SingleStore (20)

Recently uploaded (20)

How Microsoft Built and Scaled Cosmos