0% found this document useful (0 votes)
39 views120 pages

Understanding SharePoint Online v2

The document is a comprehensive guide for engineers working on SharePoint Online (SPO) and OneDrive for Business, detailing service capabilities, design principles, and engineering practices. It covers topics such as service topology, traffic routing, identity and compliance, and security measures, emphasizing the importance of reliability, scalability, and customer data protection. The guide also outlines the dependencies on various Azure services and the process for adding new features while adhering to established design principles.

Uploaded by

Victor Alp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views120 pages

Understanding SharePoint Online v2

The document is a comprehensive guide for engineers working on SharePoint Online (SPO) and OneDrive for Business, detailing service capabilities, design principles, and engineering practices. It covers topics such as service topology, traffic routing, identity and compliance, and security measures, emphasizing the importance of reliability, scalability, and customer data protection. The guide also outlines the dependencies on various Azure services and the process for adding new features while adhering to established design principles.

Uploaded by

Victor Alp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 120

Understanding

SharePoint Online

An Engineer’s Guide, Refreshed

2023 Edition

Understanding SPO v2023 i Microsoft Confidential


Chapter 1 Introduction ................................................................................................................................. v
1.1 Service is a Verb .................................................................................................................................. 1
1.2 Glossary ............................................................................................................................................... 1
1.3 Service Capabilities ............................................................................................................................. 1
1.4 Dependencies...................................................................................................................................... 2
1.5 Adding New features .......................................................................................................................... 3
Chapter 2 Service Design Principles .............................................................................................................. 4
2.1 Secure.................................................................................................................................................. 4
2.2 Reliable................................................................................................................................................ 5
2.3 Deployable .......................................................................................................................................... 5
2.4 Manageable ........................................................................................................................................ 6
2.5 Testable ............................................................................................................................................... 6
2.6 Debuggable ......................................................................................................................................... 6
2.7 Scalable ............................................................................................................................................... 6
2.8 Cost Efficient ...................................................................................................................................... 7
2.9 Highly Available ................................................................................................................................... 7
Chapter 3 SPO Engineering ........................................................................................................................... 8
3.1 Coding ................................................................................................................................................. 8
3.2 Build .................................................................................................................................................... 8
3.3 Validation .......................................................................................................................................... 11
3.4 Sandbox............................................................................................................................................. 13
3.5 Pull Request ...................................................................................................................................... 15
3.6 Continuous Integration Pipelines...................................................................................................... 15
3.7 Shipping changes safely .................................................................................................................... 15
Chapter 4 Topology ..................................................................................................................................... 23
4.1 Physical Topology .............................................................................................................................. 23
4.2 Service Topology ............................................................................................................................... 24
4.3 Virtual Machine Topology ................................................................................................................. 25
4.4 Stamp Topology ................................................................................................................................ 26
Chapter 5 Grid manager.............................................................................................................................. 28
5.1 Grid Manager .................................................................................................................................... 28
5.2 Deployments ..................................................................................................................................... 35
Chapter 6 Traffic Routing ............................................................................................................................ 37

Understanding SPO v2023 ii Microsoft Confidential


6.1 SPO DNS structure ............................................................................................................................ 37
6.2 Inbound Routing ............................................................................................................................... 39
6.3 Outbound Routing ............................................................................................................................ 40
6.4 Azure Front Door............................................................................................................................... 41
6.5 Global Traffic Management .............................................................................................................. 42
6.6 Microservice Traffic Routing ............................................................................................................. 43
Chapter 7 Request Processing .................................................................................................................... 45
7.1 Request Process Pipeline .................................................................................................................. 45
7.2 Request Attribution .......................................................................................................................... 46
7.3 Service Health Protection Throttling ................................................................................................ 47
7.4 Quota Rate Limits.............................................................................................................................. 47
7.5 Circuit Breaker .................................................................................................................................. 48
Chapter 8 Identity and Compliance ............................................................................................................ 51
8.1 User sign in ........................................................................................................................................ 51
8.2 Service-to-Service Calls ..................................................................................................................... 52
8.3 Resilience To AAD Outage ................................................................................................................. 53
8.4 Continuous Access Enforcement (CAE) ............................................................................................. 54
8.5 Compliance & Policies ....................................................................................................................... 55
8.6 OneDrive Consumer Stack (Within SPO) ........................................................................................... 55
8.7 OneDrive Consumer Stack (Within Consumer Farms) ...................................................................... 56
8.8 SharePoint Online Directory Service (SPODS) ................................................................................... 56
Chapter 9 Storage ....................................................................................................................................... 57
9.1 Blobs.................................................................................................................................................. 57
9.2 Databases .......................................................................................................................................... 61
Chapter 10 Internal workloads ................................................................................................................... 66
10.1 Need for a separate VM role .......................................................................................................... 66
10.2 Timer Jobs ....................................................................................................................................... 67
10.3 Content Push Service / Search Crawl .............................................................................................. 68
Chapter 11 Tenant Life Cycle ...................................................................................................................... 70
11.1 Forward Sync from MSODS ............................................................................................................. 70
11.2 Provisioning Tenants and Users ...................................................................................................... 71
11.3 Multi-Geo ........................................................................................................................................ 76
Chapter 12 Capacity management and Disaster Recovery......................................................................... 80

Understanding SPO v2023 iii Microsoft Confidential


12.1 Compute capacity management ..................................................................................................... 80
12.2 Active-Active ................................................................................................................................... 88
12.3 Disaster Recovery............................................................................................................................ 92
Chapter 13 Distributed Cache ..................................................................................................................... 97
13.1 Azure Redis Cache ........................................................................................................................... 97
13.2 Azure Redis Persisted Cache ......................................................................................................... 100
13.3 SSDCache....................................................................................................................................... 100
Chapter 14 Telemetry Ecosystem – Instrumentation, Monitoring and Alerting ...................................... 103
14.1 Primary Azure Tools Used for the SPO Instrumentation, Monitoring and Alerting ecosystem ... 103
14.2 SPO-Specific technology and tools that in some cases augment Geneva .................................... 104
Chapter 15 SPO Security ........................................................................................................................... 105
15.1 Security Fundamentals.................................................................................................................. 105
15.2 Security investments at every layer .............................................................................................. 105
15.3 Security monitoring....................................................................................................................... 106
Chapter 16 SPO SRE .................................................................................................................................. 108
16.1 Our Vision & Mission..................................................................................................................... 108
16.2 Team Structure ............................................................................................................................. 108
16.3 Incident Response – Follow the Sun ............................................................................................. 109
16.4 Responsibilities and Managing the Service................................................................................... 109
16.5 SRE Tooling.................................................................................................................................... 112
Chapter 17 Acknowledgement ................................................................................................................. 115

Understanding SPO v2023 iv Microsoft Confidential


Chapter 1 Introduction
This document provides a primer for engineers working on SharePoint Online and OneDrive for Business
(SPO and ODB). It gives an overview of the major subsystems of the service and how they relate, as of
2022 unless noted otherwise. The SharePoint Online service debuted more than one decade ago in June
2011. It is a component of Office 365 which consists of Exchange Online, SharePoint Online, Teams,
Office Client Subscription, the Online Platform services (admin and billing) and many others. Customers
purchase tenancies through Office 365, which enables them to grant licenses to a set of users for a
monthly fee per user. Once users are licensed, they are authorized to access sites in SharePoint Online.

SharePoint Online and OneDrive for Business is a living system, and as such, our documentation lives as
well. We intend to refresh the content from time to time. Any errors, omissions and suggestions for
future contents should be directed to [email protected] or [email protected].

Lots more documentation about ODSP could be found at

http://aka.ms/odspwiki - All-up Wiki for ODSP

http://aka.ms/spowiki – SharePoint Online Wiki

In addition, authors have included more links to specific areas throughout this document. Hence,
viewing this document online has a unique advantage over a printed copy.

Understanding SPO v2023 v Microsoft Confidential


1.1 Service is a Verb
A service is something you do, not something you ship. It is an ongoing activity and responsibility
because customers rely on it to always be available. Our customers are in every region on the globe, so
it is always “business hours” for our customers. SPO stores over many EBs of user data. These are
business-critical data entrusted to us by the customers. It is paramount that the service gives them
access to their data whenever they need it, and that the service protects their data from anyone who
would misappropriate or misuse it. While SPO has a great SRE team as the first line of defense against
any problem in the service, it really takes every team, across every discipline, to take this responsibility
to heart, and do the best we can to ensure the service is available to customers whenever they need to
use it.

1.2 Glossary
The following are a few terminologies which will appear a lot in this document. The brief definition gives
the reader a quick idea what they are and makes the context more meaningful. The relevant chapters
will provide deeper descriptions of these concepts.

Grid Manager - the central controller responsible for deploying and maintaining the SPO service.

Farm Stamps - are a unit of SPO deployment and management.

EDog, SPDF, Prod Bubble, Prod – Deployment rings starting from the innermost one

Tenant – Represents a customer in SPO, usually a company with varying number of users.

Content Database – A type of database which stores the metadata of customer contents

One Drive for Business – A type of SharePoint site specially designed for personal use.

SharePoint Team Site - A type of SharePoint site specially designed for team collaboration.

Sandbox – Test VM(s) which runs most of the SPO components.

COSMOS, MDM, MDS and Geneva – Telemetry systems used by SPO

1.3 Service Capabilities


SPO provides a rich set of functionalities over the internet, including document storage and
management, team collaboration sites, file synchronization via a sync client, tight integration with
Microsoft Office clients, and more. It is also increasingly becoming a platform for developers and
partners to build more sophisticated solutions. From a customer’s point of view, an SPO tenancy
provides many advantages over an on-premises installation. First, it is worry free. Customers do not
need to worry about installing and managing any hardware or software. All the functionality comes with
the service in a seamless way. Second, it is cutting edge. When it comes to security improvements, or
new functionality, or backend hardware, Microsoft engineers will ensure the best practices are being
followed, and latest software improvements are rolled out to the customers without any action required
from them. Finally, it provides peace of mind. Customers can count on the support from the SPO team in
case their site or data runs into any problem.

As of September 2022, SPO service runs from many data centers across the globe. It has more than 2 Eb
of user content. It serves over 250 million monthly active users, with a peak daily RPS over 4 million.

Understanding SPO v2023 1 Microsoft Confidential


SPO has different types of licenses which will suit customers of different sizes. Large corporations can
setup their tenancy such that multiple subsidiaries could be setup in different Geo locations to meet
data sovereignty requirements. They could even link their local on-premise SharePoint installation with
their cloud based tenancy in SPO to allow their user experience good cloud experience while keep tight
control of the critical data.

Tenancies typically have a domain name of the form “companyname.sharepoint.com”. Tenants can also
associate site collections with “vanity” domain names, so that public facing site collections can be
addressed via either “companyname.sharepoint.com” or simply “companyname.com”.

The scale unit of SharePoint content is called a site collection. Each tenancy can have up to 2,000,000
site collections with some exceptions. Tenant administrators can assign users varying levels of access to
each site collection individually.

The Service Level Agreement (SLA) promises 99.9% uptime to customers, with financial penalties if the
service falls below that level of reliability. “Uptime” is essentially defined as the service is fully accessible
with no reduction in functionality. Read-Only time is considered downtime. SPO has established a
dedicated SRE team as the first line of defense against any service instability. If there is anything the SRE
team cannot resolve on its own, the engineering team area experts will be engaged. SPO also has
monitors which constantly try to access synthetic heartbeat tenants in every database and will page OCE
when any farm’s availability dips below 98%.

SPO service builds many layers of redundancies. Two identical copies of user content are saved to Azure
blob service in different Geo locations. Databases are also replicated to different Azure availability zones
to ensure minimum loss in the event of a disaster. The service builds deployment units in pairs so that
user traffic could be switched to a different set of servers when problems occur in one data center. The
service also ensures that data copies are available to allow customers to request restoration of
accidentally deleted data to any point of time in the recent past.

SPO is also working to obtain various certifications, including FISMA, HIPAA, ISO27001, and SAS70. As
part of these certifications, the service must be able to provide audit logs to customers of everyone who
has accessed customer data or Personally Identifiable Information within their tenancy.

1.4 Dependencies
SPO has dependencies on many partners Office 365 and Azure services:

1 Azure Active Directory. AAD is the central Office 365 repository for tenant and user identities. All
information in SPO about tenancies and users is replicated from AAD. AAD also provides
authentication for SPO users. The “Identity” Chapter has more information about this.
2 Azure Front Door. AFD handles many chatty HTTP traffic from a server which is very close to the end
user which greatly boosts network performance and reduces latency. The “Routing” chapter has
more information about this.
3 SQL Azure DB. SPO stores user content meta data in SQL databases. The “Storage” chapter has more
details on how SPO utilizes this service.

Understanding SPO v2023 2 Microsoft Confidential


4 Azure Storage Block Blob. SPO stores user content in ABS. The “Storage” chapter has more details
on how SPO uses this service.
5 Azure Redis. The “Storage” chapter has more details on how various SPO components use this
network-based cache service and how it is managed.
6 Azure DNS and Traffic Management. The “sharepoint.com” is one of the largest DNS domains on
earth, and it is hosted by Azure DNS. SPO also uses the traffic management functionality to route
services like lists.microsoft.com to SPO service. The “Routing” chapter has more details on these
topics.
7 Substrate. Substrate is a storage service provided by Exchange online. FAST service provides search-
related functionality for SPO customers. Its data such as the index catalogs are stored in Substrate.
8 Cosmos and Geneva. These are the two main services SPO uses to store its telemetry data for both
diagnosis and data mining.
9 Azure VMSS. The majority of SPO’s compute nodes are running on its own bare metal servers. It has
started running a very limited number of compute notes on Azure VMs and the number is expected
to grow over the next few years.

1.5 Adding New features


Each feature that is added to SharePoint Online must conform to the design of the service. In particular,
the feature must support all the horizontal attributes of the service described in this document, such as
flighting, dark deployment, build-to-build upgrade, and disaster recovery.

It is particularly important that engineers understand the telemetry necessary for their features. As
described in later chapters, engineers are generally prohibited from attaching debuggers in the
production environment. Therefore, all tuning and debugging must be done via telemetry and logging.

For each of the horizontals, designs that conform to existing idioms in the service can be expected to
rely on existing mechanisms. Designs that break existing idioms must account for implementing the
necessary functionality to satisfy the horizontal requirement.

Understanding SPO v2023 3 Microsoft Confidential


Chapter 2 Service Design Principles
We design SPO to be secure, reliable, scalable, highly available, manageable, testable, debuggable, and
cost-efficient. The following design principles support these goals. Note that these goals can be
contradictory to each other at times. Making things more secure may make it less cost efficient, or less
debuggable, etc. The general priority order we adhere to in SPO is Security, Availability, Reliability and
Performance, Cost.

2.1 Secure
Design Point: Assume malicious users. SPO must defend our infrastructure and our customers from
malicious tenants, malicious users, and malicious engineers. We must guarantee to our customers that
no one can access content they don’t have rights to, whether the attacker is an authenticated user or an
unauthenticated user. This must be true within a given tenant and between tenants.

Design Point: Assume that security measures will fail. Malicious users will figure out how to execute
code on front-end servers. Therefore, we must impede the attacker’s progress by running local services
with least-privileges and limit the attacker’s impact by ensuring that the server’s identity can impact
fewer than 1% of tenants. No component of the system should ever fully trust another component in
the system.

Design point: Assume that Microsoft engineers are malicious. Engineers must never have access to
Customer Content without manual approval from ODSP leaders. When approval is granted, it is on a
tenant-by-tenant basis. Customer Content must be encrypted before storage such that engineers with
access to the storage location cannot view clear-text content without going through SPO’s gated access
mechanism. This must be true of all services that handle SPO Customer Content, including partner
teams and Azure dependencies.

Design Point: It is impossible to parse binary files without introducing exploits. All code that parses
binary files must run in a sandbox and should “gatekeep” by rejecting invalid or malformed input where
possible.

Design Point: Never allow users to consume infinite resources. Users may accidentally or intentionally
attempt to consume datacenter resources to induce a denial-of-service condition. No user should ever
be allowed to consume sufficient resources to degrade the experience of other users.

Design Point: Audit everything, both Microsoft actions as well as customer actions. This allows SPO to
meet our compliance requirements and enables customers to perform self-service investigations
without engaging Support. Operations taken by engineers in our system must be approved by a peer or
leader and are always logged and audited.

Design point: Features must meet SPO’s security, privacy, and compliance requirements from day one.
These properties must be designed-in from the start by the feature owner, not bolted-on after shipping
or outsourced to another team.

Design point: The buck stops with us. While we take a One Microsoft approach to leveraging the best
people, processes, and technology to achieve our security goals, the ultimate accountability for
protecting Customer Content rests with us and cannot be outsourced or delegated. Our aim is for SPO to
be the safest cloud for the world’s most valuable data.

Understanding SPO v2023 4 Microsoft Confidential


2.2 Reliable
Design point: expect failure. Hardware will fail daily. Partner services will go down or give incorrect
output. Processes will crash. SQL servers will corrupt data.

Design point: all management operations must be transactional or idempotent. Usually, you cannot
make any action involving more than one machine truly transactional. Therefore prefer to make
management code idempotent, meaning that assuming x is the state of the system, then fn(x) =
fn( fn(x) ).

Design point: Devise a strategy to deal with failures and auto mitigate wherever possible. Retry is often
used as a mitigation, but one must not retry forever. Most failures are intermittent. Therefore, retry
frequently is successful. However, infinitely retrying is always a bad idea; it creates a hang, consumes
resources, and can lead to denial-of-service attacks.

Design point: be humble about writing automatic failure handling code. Handling failure conditions is
incredibly hard to get right, and typically even harder to test. Come up with a failure handling strategy
for your feature at the design phase. And all strategies must be as simple as is feasible. Complex failure
handling strategies often end up introducing more potential failure points. For major failures that can
impact the service, don't be afraid to inject a human into the loop where judgment calls are necessary.

2.3 Deployable
Design Point: Never hardcode configuration settings. The service is deployed in many different
environments, from local-on-machine Sandboxes to EDog to Production to Sovereign clouds. Farm
stamps are different in each environment and will change over time. All configuration-specific
parameters must be read from a deployment configuration file and must never be hardcoded in scripts
or Grid Manager.

Design Point: Write managed job wherever possible and keep PowerShell code at a minimum.
Deprecating them if you could help it. PowerShell is a scripting language and does not have the support
infrastructure of a full-fledged programming language. It can be useful at times due to its late binding
but becomes cumbersome when complex logic is needed.

Design Point: Upgrades must involve service zero downtime. We will not take the service offline when
upgrading components, though we do allow for limited read-only windows (a few minutes each month).
All components must be capable of upgrading without going offline. All components must deal with
other activity in the service during their upgrade.

Design Point: Upgrade must be fully parallel. It must never be the case that only one of an entity (farm,
VM, network, etc.) can be upgraded at once. We will soon have too many farms in the service to do
anything one at a time.

Design Point: Mixed versions of components are normal. Every component must deal with multiple
versions of related components running at once. Grid Manager and Farm stamps are upgraded at
different times. The farms within the same farm stamps are also upgraded with an intentional time lag
in between. This process can take days to weeks. Every component will run most of the time with mixed
versions of other components in the system.

Understanding SPO v2023 5 Microsoft Confidential


Design Point: All features must deal with read-only time gracefully. Because we will have read-only time
every month, all features must deal with it gracefully and fully automatically. Components must not lose
customer data on the transition from read write to read-only time.

Design Point: All deployments must follow Safe Deployment Policies.

2.4 Manageable
Design Point: Lights-out management. Do not assume that engineers can intervene in management
activities in the normal course of events. Do not assume any live person is actively monitoring capacity
or reliability. Assume command and control are down during serious incidents. Design mitigations to be
automatic and self-contained.

Design Point: No gestures should be made machine-specific, farm-specific, or network specific. No


management gesture, deployment gesture, or telemetry can be farm specific. All actions must fan out
over all necessary machines and farms automatically.

Design Point: Codify and test service cross-version compatibility contracts. Services are never upgraded
in lockstep, therefore cross-version compatibility must be perfect. The contract must be codified, and a
test architecture must be used to enforce the contract

Design Point: Critical situations need to raise alerts with appropriate severity levels. Pageable alerts
need high quality TSG.

Design Point: Use telemetry to understand and optimize the service. SPO has a rich set of telemetry and
logs saved in COSMOS, MDM and MDS. Study both the synthetic test traffic and real user activities.

2.5 Testable
See the Validation section in the SPO Engineering chapter.

2.6 Debuggable
Design Point: Assume all debugging will occur via log inspection. In general developers cannot get
access to machines containing customer data in production. Design your logs such that you will have all
the information you need to debug remotely using only logs.

Design Point: Support dynamic logging levels. When debugging issues, you usually have too much log
data or not enough. Your life will be easier if you can dynamically enable or disable specific component
logging or verbosity without deploying new code.

Design Point: Utilize the sandbox, especially the local sandbox which lives on engineer’s own machine.

2.7 Scalable
Design Point: Use the content DB. The content DB has a pre-built solution for high availability,
backup/restore, and disaster recovery, along with a solid partitioning story that will work well in SQL
Azure. Do not add a new database role without an overwhelmingly strong reason.

Design Point: Grid Manager must not be critical path for user latency. No code should call into the Grid
Manager in a way that impacts the end user’s perceived latency. Assume Grid Manager is very far away
with high latency (as is the case in all datacenters outside North America).

Understanding SPO v2023 6 Microsoft Confidential


Design Point: Replication is (probably) not the answer. Replicating large data sets from one place to
another is expensive in terms of I/O, network bandwidth, and disk space. Furthermore, it limits
scalability whenever a full re-sync may be required -- we can only scale up in so far as the full replication
time doesn't become untenable.

Design Point: Tenants and data move. As SQL servers, farms, and networks become unbalanced,
tenants’ data will be moved to restore balance. Do not assume data will stay where it is originally
placed, and do not preclude quickly moving data from one place to another. Do not introduce cross-
service dependencies that complicate moving data from one database to another.

Design Point: Be careful with local optimization. Frequently optimizing one part of the service can cause
bottlenecks in other parts of the pipeline. Be sure to keep the overall design of the service in mind when
optimizing articular components.

Design Point: When adding a new sub system, consider designing for large-scale and reduce resource
waste due to too granular partitions.

2.8 Cost Efficient


Design Point: Storage is the most expensive part of the service. Azure Storage and Azure SQL together
cost more than two thirds of the SPO service annual spending. The storage cost of meta data (SQL) is
several magnitudes more expensive than the actual user content (Blob). Be efficient about the amount
of data stored, especially meta data.

Design Point: I/O operations are the most expensive part of storage. Data at rest is significantly cheaper
than data in motion. Reduce SQL I/O to the bare minimum.

Design Point: Don’t raise the cost per user. No single feature may raise the per-user cost of running the
service.

Design Point: Zero cost for inactive tenants. Many of the tenants in the system will do little with it. Drive
the cost of those tenants as close to zero as possible by driving their resource consumption as close to
zero as possible.

2.9 Highly Available


Design Point: Load Balanced for compute. Ensure redundancy exists for metadata and blob store.

Design Point: Prefer failover. If the system is misbehaving, failover by default to a known working
system.

Design Point: Keep the service up at all times. An error response is better than no response. A successful
response is better than an error response. Those signals prove useful in debugging what is wrong with a
service.

Design Point: Dependencies that are key to the operation of the system need a failsafe, e.g., AAD or
DNS. SPO should not go down hard if a partner dependency goes down hard. For dependencies that are
ancillary to the system’s functionality, SPO should continue to work if they go down.

Understanding SPO v2023 7 Microsoft Confidential


Chapter 3 SPO Engineering
SharePoint’s Engineering System is based on a combination of One Engineering System (1ES), Office
Engineering (OE) and our own custom code. SharePoint originated as a part of the Office repository, and
while we now build a service, not boxed software, we still have remnants of the legacy client builds.

This chapter does not aim to be a comprehensive manual or background on all things Engineering. I
encourage you to bookmark https://aka.ms/spogithelp as your starting point for all documentation on
the SharePoint Online Engineering system as well as how to get started. Along with that, please make
sure you join ODSP Engineering on Teams (https://aka.ms/odspeng), and contact us in the SPO
Troubleshooting channel.

Your primary point of entry for any SPO-specific operations within the codebase should be the dev
command. Run dev without any parameters to see a list of what it can do; similarly to git.

In ODSP, we are highly committed to building a highly trusted product; this is one that is highly reliable,
scalable, and cost-effective. To that end, we have policies in place to ensure that your changes are
healthy, and we have mechanisms to control the rollout speed of changes. Read on to learn more about
these.

Finally, the Engineering Systems in ODSP are supported by the Engineering Fundamentals team. If you
have other questions, feel free to reach out to us at [email protected].

3.1 Coding
SPO code is in Git hosted in AzureDevOps. Given the size of our repository, we have the luxury of not
needing to use add-ons such as VFS for Git or Scalar. We just use vanilla Git and as such you can use
whatever Git compatible tooling you want.

SPO is built using a mixture of langages: C#/C++/PowerShell for product, C#, PowerShell, and some Perl
for internal tooling.

SPO has first-class support for Visual Studio Enterprise and the CMD and PowerShell command shells.
You are, of course, free to use whatever text editor or Git tools you like.

All coding must be done in user-created topic branches and pushed via a Pull Request.

3.2 Build
3.2.1 Last-Known Good Builds
SPO has a notion of “Known Good” builds. It is recommended to branch from a Known Good (KG)
branch for your build to succeed, while main HEAD is also buildable in most scenarios (If main HEAD is
not buildable in your scenario, please branch from a KG).

Known Good (KG) Git branches are created several times every day, and are named:
origin/build/main/<checkpoint version>

The Last Known Good (LKG) branch is always named origin/build/main/latest. You can list the current
LKG and previous KG branches by running dev lkg list.

Understanding SPO v2023 8 Microsoft Confidential


3.2.2 How SPO.Core is Componentized
The SPO.Core repository is split into individual projects; sometimes called “Office Projects” due to their
legacy with the Office build system. In our previous source control system, you could map certain
projects to your local enlistment, but with Git, the only impact is on build speed. Each project can either
be built locally, or have its build outputs imported (i.e. downloaded) from the cloud. Building all projects
is possible but not recommended as it takes 12+ hours. It is generally most efficient to build only the
projects you expect to make changes in.

To take advantage of this you run dev scope add to specify which projects are built locally; more
projects are also built locally if they have local changes or have changes between local branch and the
nearest KG; other projects will be imported. You will want to check with your team to understand which
projects to pull in.

If you would like to only build projects added by dev scope add, run dev scope set AddStaleProjects
false. If you would like to know the status of each project, run dev scope status -g and/or dev scope
blame.

3.2.3 Build Logic


SharePoint is ultimately built with an engine called BuildXL (or Domino – the original codename). The
internal script format that Domino uses is not exposed to the user for various reasons. SPO exposes
NMAKE-based files instead.

So, What happens when I build SharePoint Online?


Good question; when you build SharePoint Online, a process called MetaBuild is run before the build
actually starts. This converts the NMAKE files to DScript (DominoScript) as well as generates CSPROJ files
that can be opened in Visual Studio. Once that is done, the BuildXL engine runs on the DScript files.

This does imply that if you change anything in the NMAKE files, you must re-generate the DScript and
CSPROJ files.

How to build?
Easy!

Build your scoped projects dev Build

Build your current folder and upstreams dev BuildIncremental

Ok, so what does this all really do under the covers?


Curious about what happens under the covers? If so, read on.

Note to the reader: this section is pretty dense and contains a lot of jargon. We attempt to define terms
here:

PIP PIP(Primitive Indivisible Process) is the smallest unit of work tracked by BuildXL's
dependency graph. Generally these are process invocations but may also include other
primitives like WriteFile or CopyFile pips, Service and IPC Pips.

Understanding SPO v2023 9 Microsoft Confidential


BuildXL BuildXL (Microsoft Build Accelerator) is a build engine originally developed for large
internal teams at Microsoft. It leverages distribution to thousands of data center machines
and petabytes of source code, package, and build output caching. Thousands of developers
use BuildXL on their desktops for faster builds.

DScript DScript scripting language is used by BuildXL as one of its front-end build specification
languages. DScript is based on TypeScript.
MSBuild MSBuild(Microsoft Build) is a platform for building applications. MSBuild provides an XML
schema for a project file that controls how the build platform processes and builds
software. Visual Studio uses MSBuild to load and build managed projects, but MSBuild
doesn't depend on Visual Studio. The project files in Visual Studio contain MSBuild XML
code that executes when you build a project by using the IDE.

DBuild's job is to write DScript (new name for DominoScript) and to call BuildXL (new name for Domino)
to run that DScript. Initially, dbuild does not know anything about your enlistment, including which
projects you are enlisted in. Turns out that trying to scan all of your source tree to determine what to
build is expensive. So there has a process to do this for us (metagraphcreator.exe) which scans your
enlistment, finds which projects you are enlisted in, and writes the DScript needed to parse the
NMake/Sources files for those projects. Since metagraphcreator takes time to run, DBuild writes dscript
to call metagraphcreator, and calls BuildXL on that Dscript. Therefore the speed to determine what you
are enlisted in in the median build case is the speed at which BuildXL can figure out something has
changed.

Once we have called BuildXL for the Enlist Build, we have Dscript for each project which knows how to
run nmake/msbuild. However, we still don't know what are the inputs and outputs to each compile, link,
etc... So we run nmake/msbuild is a special mode in the metabuild to read the make logic but not
execute it. Instead, it writes out what it would do to a file which we call the Static Graph, and we convert
the static graph into dscript. That dscript is then executed under a third call to BuildXL call the Product
Build.

The end goal is to build just the PIPs that you have selected by giving arguments to DBuild. Each PIP has
tags on it corresponding to the build directory it is apart of and the project it belongs to. BuildXL allows
you to filter to specific specs using a boolean string filter. For example, when calling “dev bi”, we will
filter to 'dir:d:\SPO\dev\sts' if you build from the d:\SPO\dev\sts directory. Any pip in the d:\SPO\dev\
sts\stsom directory would be tagged with 'dir:d:\SPO\dev\sts\stsom' as well as 'dir:d:\SPO\dev\sts' and
'dir:d:\SPO\dev', so building in any of those directories would pip up the pip.

In order to build a PIP in the Product build, we have to ensure that the DScript is available to read and
work from. This means filtering to the pip which produces it in the metabuild, and any DScript it might
need. Metabuild pips are grouped by project, so if you want to compile one file the STS project, you
need the Dscript for the STS project, as well as any upstream projects STS might import from. Therefore,
the metabuild is filtered to the projects you care about + upstream projects. Notice that if you are
enlisted in 100 projects but choose to only build the top projects, it will never even look at the
Nmake/Msbuild build logic for the other parts.

Understanding SPO v2023 10 Microsoft Confidential


What is the scope
What is the scope of the
Build Step What does it do? of what is filtered
Pip Graph
to?

Enlist Build Creates DScript needed to run the


Prep enlist build (enlistbuildrunner.dsc)

Determines which projects are in your Each pip works on the There are no filters
Enlist Build
enlistment entire enlistment used

Reads the outputs of the Enlist Build to


Metabuild
determine which up-stream projects
prep
need to be filtered in the metabuild

Reads the make logic without


The platform/flavor you The project you
executing the build graph, writes
are building (eg want to build and
Metabuild DScript that contains the
x64/debug), all projects all upstream
compiles/links/etc... and their
for that platform flavor projects
intputs/outputs

The PIPs you want


The project you want to as filtered to the
Product
Run compiles, links, csc, etc... build and all upstream current working
Build
projects directory or other
filters

More information about BuildXL (Domino)

Internal official doc - https://dev.azure.com/mseng/_git/Domino?path=/README.md&_a=preview

Ben Witman gave a talk describing the internals of BuildXL including how cache lookups and sandboxing
is done. Recording is here - https://msit.microsoftstream.com/video/9fc52d3a-390d-4e77-b201-
e382cfa55408?list=user&userId=2672a5d1-cf09-45ae-96d5-e97a23a2edcd

3.3 Validation
SharePoint Online leverages a “shift left” philosophy (with “left” being closer to the moment of change
and “right” being closer to the end of deployment) with the goal of running most validation as a gate in

Understanding SPO v2023 11 Microsoft Confidential


the Pull Request in order to ensure the main branch is always clean. We currently (as of Oct 2022) have
about 3500 functional tests and 43000 unit tests running in each Pull Request.

Figure 3-1 Validation Loop

SPO.Core supports 3 main classes of tests. Lower classes of tests are more reliable and faster. Having
said that, we acknowledge that lower classes of tests may be harder to write. Given the number of tests
we have and growth we anticipate, this speed and reliability at run-time is far more impactful than the
added time it often takes to write them. Please try to write as much as possible as unit tests. If you have
questions on how to do this, or your area presently does not have any, reach out to the ODSP
Engineering Fundamentals team.

The following is our organization’s official terminology:

L0 Tests Tests that run rapidly, make calls directly against compiled assemblies, do not require
deployment/installation, and have no external dependencies or state. Also commonly
referred to as “unit tests”. In SPO, these run in QTest as part of the build. The test harness
is MSTest. (This will move to MSTestV2 in CY2023.)

L1 Tests Tests that execute against compiled assemblies and require SQL state. These do not take
external dependencies or require the product to be running. For these tests in SPO, we
execute a full dev deployment, because this is how SPO SB state is deployed. These run in
CloudTest using MSTestV2 with VSTest.

L2 Tests Tests that execute against web APIs. These require a full dev deployment and require that
the product is running. These may take limited external dependencies. These also run in
CloudTest using MSTestV2 with VSTest.
L3 Tests Tests that execute with full dependencies and a full environment. In SPO these would run
against SPDF or MSIT. EFun currently has no support for these, though some teams do
run L3 tests that they manage.

L0 tests can be run directly on your build machine. L1 and L2 tests can be run against a deployed
Sandbox.

Understanding SPO v2023 12 Microsoft Confidential


Primary guidance and principles for testing in SPO are:

1. Code must be tested. Plan for testing as part of each feature. Do this hand-in-hand with the
feature design rather than after the fact to ensure that your design is testable and that testing is
funded. Grafting on testability later is generally considerably more expensive.
2. Tests are treated as product code and should get the same level of care as any other product
code. Tests must be reliable, must be correct, and must be high quality code. Tests must also
have clear owners.
3. Ensure that others can test on top of your feature. If you are building a feature in SPO, you need
to test your own feature but you must also support that consumers of your feature can also test
reliably.
4. Prefer smaller / simpler tests. In general, L0 is better than L1 is better than L2. Smaller tests are
faster and more reliable. Cover most functionality with L0 whenever possible. Fill in gaps and
cover more complex integrations with L1 and L2 tests.
5. Make the product testable. Quality product engineering accrues to testability. Tightly coupled
code, long dependency chains, and unstated assumptions/contracts all make it much more
complex to test. Testable code is a feature on its own, but also generally indicates better -
engineered code.

Testing is key for product quality, and test reliability is key for engineering experience. Low
reliability tests create pain for every engineer who runs them. EFun tracks reliability and is actively
driving up reliability through flaky test management (including disabling low-reliability tests).

More details and guidance are in the validation wiki. There is also an SPO test authoring training
available, including a recorded presentation.

The SPO Test Authoring channel in the ODSP Engineering Team is also an excellent resource for
writing and supporting tests.

3.4 Sandbox
Sandbox is an internal-facing Azure service that hosts ad-hoc integrated testing environments for over
750 ODSP engineers who depend on it daily to increase their productivity.

Sandbox is a highly available, performant, worldwide service. It is deployed in Azure Functions, built on
top of Azure DevTest Lab and tightly integrated with Azure DevOps. It provides safe environments with
isolated test deployments of services, allowing engineers to develop, debug, and perform integration
testing in a simulated datacenter.

With a Sandbox checked out from Azure DevOps UI, engineers can:

- Access the integrated testing environment immediately, no waiting time!


- Patch their local code changes to Sandbox via 1-Click Dev Patching Visual Studio Extension
- Local debugging inside sandbox via 1-Click Any Source Visual Studio Extension
- Remote debugging from your own dev box.

Understanding SPO v2023 13 Microsoft Confidential


Sandbox Image Service takes daily SPO builds and produces daily images in Azure Computing Gallery.
Sandbox Provision Service then re-stamps images into DTL running VMs, enriches the VMs with various
development capabilities, and makes the VMs ready for instant checkout for engineers.

Figure 3-2 Sandbox Architecture

Sandbox service now provides more than 10 different configs, including basic SPO-Onebox (SPO all
server roles packed in a single VM), Multi-Geo Sandbox, SPO-OneBox with Prod AAD, etc., and several
latest versions for each config. Additionally, Sandbox service provides customization capabilities for
engineers to customize by provisioning self-defined or Sandbox-defined artifacts upon checkout.

For quick start and user guide, see Sandbox Wiki.

For the latest Sandbox features, see Sandbox news.

Understanding SPO v2023 14 Microsoft Confidential


3.5 Pull Request
The goal of the Pull Request is to execute a comprehensive set of tests and security policies.

We recommend using dev pr, which makes a Draft PR first. You can then publish it from AzureDevOps.

If you want to create a “buddy build” which doesn’t ask for reviewer sign-off, but runs the policies, you
can make Draft PR.

3.6 Continuous Integration Pipelines


We run several continuous build and test pipelines for SPO. We have rolling builds that run L0/unit tests
as well as continuous integration testing (CIT) runs that run L1 and L2 tests.

In addition to direct Pull Request monitoring, EFun also monitors rolling builds and CIT runs for quality.
These provide visibility into engineering blockers and also help pinpoint new issues when they are
discovered. Below are the primary builds and CIT runs:

• YAML CI Build - OfficialCITReporting – This pipeline matches the build in PR. It runs the full
build matrix for SPO.Core, including SideVar. This also runs L0/unit tests in QTest. This is a
batched build on the main branch that runs multiple times per hour.
• Git CIT PR Validation (Official) - BASELINE – This release matches the CloudTest runs in PR. It
runs all L1/L2 tests that are part of PR. This runs every hour using the latest rolling build.

You can find the full list of CIT releases in the Azure DevOps Releases page. Search for “CIT”. There are
numerous CIT runs that cover the Sidevars, search, project, etc.

Currently Dogfood (SPDF) releases are not gated on the rolling build or CIT. They serve exclusively as
advisory build/test runs.

3.7 Shipping changes safely


After engineer checks in the changes it is shipped via the rings to our customers or applied to the
Sharepoint infrastructure. These changes are deployed via SPO monolithic trains or other deployment
orchestrators (example: Monolith train aka GU Security Patch train, job packages, Grid Mgr, Hotfix,
Web , Sync client etc). The principles for safe change management are:

• All rollout follows ring based deployment (rollout audience is Office dogfood, Microsoft.. to several
waves in production)
• Rollout is gradually starting with smaller updates in beginning (1%, 10%..) and ramp up more as we
deploy broader
• The rollout policy is enforced in the code. Standard Rollout must take a minimum of 14 days for WW
rollout and Hotfix rollout must take 48hours.
• Any deployment action must not impact more than 20% of ODSP footprint (fault domain)
• Engineers can leverage override workflow and get GEM approval to deploy faster via the rings
• Rollout from 0% dogfood to 100% Prod takes 14 days at the least (no deployment on weekend)
• Changes deploy from cold to hot instances.
• Leverage QoS signals to auto-stop deployment, where available
• There is visibility into changes deployed in central tools to help SRE and OCE (eg ICM change card)

Understanding SPO v2023 15 Microsoft Confidential


3.7.1 Mechanics of Rings
ODSP continuously rolls out new exciting features, updates, security patches, to keep our customers
happy and help keep the service running healthy. ODSP Changes are rolled out using different ship
vehicles and it is critical to ensure all changes must be deployed safely without impacting customer
experience.

ODSP changes to Monolith, Web , Sync Client, Mobile Clients Grid Manager, Microservices are using
different shipping mechanisms, but they all follow the same Safe Deployment Principals.

Rings of Deployments – ODSP phased rollout process starts deployments to smaller audiences before
deploying broadly.

Dogfood Ring – All M365 employees experience the changes.

Microsoft Ring – All Microsoft employees experience the changes.

World Wide Production – M365 Public, Sovereign and Gov customers experience the changes. This ring
is further sliced into different stages (aka waves) to reduce the impact of any regression. ODSP
Customers have ability to select individual users/tenants to get early experience to new features with
M365 Targeted release program.

Figure 3-3 Ring definition for services

3.7.2 Shipping to Dogfood (SPDF)


Sharepoint.Core changes are deployed up to 8 times every day to SPDF, this is done via the oloop builds.
The measurement for the daily refreshes is done via ‘Dogfood freshness’, current metric: 80%tile of SPO
changes show up in SPDF <24 hours

Understanding SPO v2023 16 Microsoft Confidential


3.7.3 Shipping to Production (& MSIT)
The SharePoint.Core codebase ships weekly. The main branch is used as the basis for a weekly
production fork which is created at 7PM Redmond time every Tuesday. Even during the change freeze
at the end of the year, a weekly production fork is created, even though the code is shipped as far as
MSIT and no further. The weekly fork is termed a “GU”, which stands for “Gemini Update” – a name
borrowed from planned twin releases of Office back in 2012 – which is one of several vestiges of
SharePoint’s deep ties with the Office engineer system before its code moved to git and Azure DevOps.

Tuesday was chosen as ship day because it is in a sweet spot during the week. Most of the weekend
code changes have been committed and have had some bake time in SPDF. Also there is enough of the
work week remaining after the production fork is created on Tuesday to be able to handle any hot Stop-
The-Train type issues for the production fork before the weekend. Tuesday is also traditionally the day
with the lowest number of code changes committed to main, reducing the chance of merge conflicts or
other entropies getting introduced in the fork.

Components such as Grid Manager similarly ship to production every week, aligning with, but shipping
independently of, the monolith. There is a desire, for the sake of simplicity and determinism, to ensure
that all components that collectively make up the OneDrive/SharePoint Online service ship weekly, in
alignment with the monolith. This does not include hotfixes – fixes in response to some hot customer
need – which ship faster so that critical incidents can be fixed rapidly.

3.7.4 ODSP Safe Deployment Policy


ODSP follows policies to deploy changes and policies are based on risk, these policies can range from 14
days (low risks) to months (high risk) for standard changes. Hotfix and Rollbacks also follow safe change
policy which can take minimum 48 hour for WW changes. ODSP does have an emergency process to
deploy changes asap which can be used during critical situation like zero day exploit. More details on
rollout policy in here.

Understanding SPO v2023 17 Microsoft Confidential


Figure 3-4 Safe Deployment Policy

3.7.5 Tracking Developer changes: FindMyChange (FMC)


FMC is a feature written as an Azure DevOps extension which allows engineers to see the deployment
status of their completed PRs.

FMC is accessed through the PR view in Azure DevOps as illustrated here.

Understanding SPO v2023 18 Microsoft Confidential


Once in the view, it’s relatively self-explanatory; there are a few key things worth calling out:

1. SPO has a LOT of farms. You can filter to a specific farm or set of farms.
2. Status icons – green checkmark = done, blue spinner = in progress
3. Direct link to the AzDO pipeline containing artifacts being deployed.
4. Due to the way SPO ships, it’s not guaranteed that a single build is the one that first one
deployed to a location that contains the relevant commit – this manifest as a dropdown to let
you view all of them.

3.7.6 Kill-Switch
KillSwitch is a lightweight mechanism that can be used to rollback code change if a regression is found. It
addresses a very specific scenario and is not intended to replace the SharePoint flighting system.

KillSwitches have the following properties:

• There is no manual deployment workflow; by default, the new code is executed for every
environment.
• If a regression is found, a KillSwitch can be "activated". This causes the old code to execute, which
we assume will revert to a "known good" state. (The older a KillSwitch is, the less valid this
assumption will be.)
• A KillSwitch can be activated at farm level, environment level (all farms within a given environment),
or globally (all farms in all environments). No other granularity (e.g. per tenant) is available.
• After a killswitch has been activated/deactivated at a higher scope, it cannot be updated at smaller
scope. e.g., if you have activated globally, you cannot deactivate in edog environment.
• KillSwitches must NOT stay activated indefinitely; they are a short-term mitigation until a root cause
bug is resolved.

In general, new features and significate code changes should use the flighting infrastructure to gradually
roll out. Code changes with high confidence (thus can be undark by default) but need to be guarded
against unexpected consequences should use KillSwitch. More details at aka.ms/killswitch

Understanding SPO v2023 19 Microsoft Confidential


3.7.7 Flighting
ODSP leverages Flight On Rails for dark deployments of Changes Flight on Rails system is designed for
enabling engineers to ship their code dark and enable the feature using a switch On/Off.

Flighting is recommended for shipping new features dark and can be used for light up a feature against
SPO User, SPO Tenant and SPO Farms scope, Flighting follow all Safe deployment principles and
leverages the policies and procedure. Flighting API are available in Monolith, Web, Grid Manager and
other repos across the board and Flighting UX and Merlin interface is available for engineers to manage
lifecycle of a flight.

3.7.8 Config
SPO also uses centralized config system for deploying configs across Worldwide footprint of service
following Safe Deployment Policies. Typical configs span from simple name-value pairs to a schematized
collections of metadata such as background jobs registration and scheduling, traffic identification hints,
or SKU classifications.

3.7.9 SideVar
SideVar is a mechanism that SPO uses to deploy product-wide changes, including but not limited to:
large code refactors, .NET framework upgrades, compiler upgrades. Broad changes like this are unable
to be put behind a killswitch or a flight; we need a separate mechanism both to rollout these crucial
changes and to allow rapid recovery in case of failure. SideVar currently has 4 pre-defined slots for
distinct payloads (in addition to the baseline, so 5 total). Payloads are restricted to things that can be
modified in the product build itself (i.e. not OS variants or drivers), by changing a build flag. Payloads are
then deployed across a thin, but growing, slice across production farms. At any stage in the deployment
policy, a payload can be graduated, and folded back into the baseline as a regular commit. SideVar
includes API monitoring and alerting; any API on a payload that shows a QoS deviation from the baseline
by more than 5% gets a Sev3 alert, which goes to the team that owns the relevant payload.

3.7.10 InfraVar
Today a Large category of infrastructure or platform changes (scoped to VM level) are rolled out
manually due to lack of an automated solution. In addition, no good solution existed to deploy changes
to network driver and TCP/IP stack at a virtual machine level. Examples include Domainless changes to
the Sharepoint topology, VHD, Updating OS of Fleet, Deploying No GACs. InfraVariant is the tech to
support VM Based Flighting for 100% of SP VM’s for Grid and SharePoint. More details in InfraVariant
RunBook.docx

3.7.11 Auto change correlation for OCE/SRE


One of the goals of Safe change management is to improve debugging experience for On call engineers
and IM’s by having all ODSP changes viewed and queried in a central location (ICM & UI). We leverage
1ES tools for that outcome:
• Federate Change Management (Central Change store)
• Service Tree (Logical service dependency map)
• Azure Graph (Physical Infrastructure object map across services)

Understanding SPO v2023 20 Microsoft Confidential


How we leverage Azure Federated Change Management:

We created a fixed schema set for ODSP change and mapped it to FCM’s fixed schema in SDK. The
changes are automatically ingest changes from registered change sources using a SPOFCM Nuget
package

Azure Graph

Correlate changes based


on impacted object.
Grid Changes
Validate authentication

SafeRollout Send Changes


SPOFCM SDK FCM

Azure DevOps Query Kusto

FCM UX Geneva Overlay PowerBI/ FCM Change Card


Lens Explorer in ICM

Figure 3-5 Auto change correlation architecture

How does auto change correlation work for ODSP?

High level:

• Each ICM Owning Team is mapped to a Service Tree.


• Changes will be sent to a service tree.
• Need to create dependency mapping between your service tree and other service tree.

Progress:
• We can correlate 70% of ODSP changes that cause regression and displayed in ICM Change card
(Site down)
• 95% of changes causing regression are displayed in central Power BI dashboard < 15 mins
(aka.ms/changeoptics)
• Can corelate 3 partner changes: SQL Azure T-Train, Phynet and TOR
• Our learnings presented to Azure LT (Jason Zander’s team) and working with FCM team to drive
this to be #OneMicrosoft tech (pushing this to be part of M365 Unified Fabric Effort (UFF)
workstream)

Understanding SPO v2023 21 Microsoft Confidential


Figure 3-6 Auto change correlation in SPO

Understanding SPO v2023 22 Microsoft Confidential


Chapter 4 Topology
SharePoint Online exists as a set of services grouped into server roles that run on virtual machines. The
virtual machines in turn run on a set of physical machines in datacenters worldwide. Thus, it is important
to understand the physical topology, the virtual machine topology, and the services topology. The layers
operate independently but must be cognizant of each other’s performance characteristics.

4.1 Physical Topology


SharePoint Online runs on hundreds of thousands of bare-metal servers hosted in racks in datacenters
around the world. A small set of these rack spaces are leased but the vast majority are owned by
Microsoft. In most cases the racks are dedicated to SPO service, but sometimes SPO must share racks
with other services as well. The physical machines are a mixture of HP Gen9, WCS 5, WCS 6 and WCS7
models. The following table shows the hardware configuration of these machines. More detailed SKU
information could be found in this document.

Model WCS5 WCS 6/7


CPU Intel Xeon E5-2673 v4 @ 2.30GHz Intel Xeon Platinum 8171M @2.6GHZ
Logical Cores 80 104
Memory (GB) 256 384
Disk (SSD GB) 6x960 (Raid0) + OS Disk 6x960 (Raid0) + OS Disk
Rack in a Zone 8 8
Servers per Rack 36 30

Racks are purchased as part of a Zone. A zone usually consists of a fixed number of Racks, connected via
an Agg Router. Zones are clustered colocation centers (colos) within the datacenter. A colo is a set of
racks that can be physically cabled together to provide connectivity without going through datacenter
switches or routers; Intra-Colo connectivity is significantly better than inter-colo, which is better still
than inter-DC. Therefore, SPO service requires its racks to live in the same DC as the Azure services it
depends on. The datacenters which have the largest SPO footprint include San Antonio, Dublin,
Amsterdam, Sidney and many more. SPO also start to have presence in small regional data center such
as South Korea, Qatar, Switzerland, etc.

Every machine is connected to top-of-rack switches (TORs). Racks are networked together by redundant
L3 aggregation switches. Load balancing is provided by third party hardware load balancer running in
active/passive pairs. The devices are primarily used to direct inbound connections from customers to
the correct SharePoint farm.

Machines are assigned to virtual LANs (VLANs). Machines from the same zone could be assigned to the
one or multiple VLANs). Physical machines and the VMs they host must be on the same VLAN.

Understanding SPO v2023 23 Microsoft Confidential


When SPO debuted 10
years ago, machines
were networked in
Default LAN Architecture
(DLA). We have recently
retired all DLA zones and
replaced it with the more
efficient Collapsed L3 Agg
(L3Agg) architecture. In
this architecture, each
Zone has two Agg routers
which serve as the
default gateway for all
VLANs and provides L3
connectivity to
DCFX/Fabric and L2 non-
blocking connectivity
between all devices.

Latency within the datacenter is largely sub-millisecond. However, bandwidth decreases as the network
distance between machines increases. Machines on the same rack have the highest bandwidth
connection, then machines on different racks but in the same colo, then machines in the same
datacenter but different colos, and lastly machines in different datacenters have the lowest bandwidth
connections.

More information can be seen here: Network architecture.

4.2 Service Topology


Customer access SPO service via a few different ways. Users can access their files via a browser, or
applications such as one drive sync. Third parties can also develop applications such as backup and
migration solutions and access SPO via APIs. Lastly, the tenant administrators can access SPO via
PowerShell as well.

Most of the time, the user requests will go through the Microsoft Azure Front Door service before
reaching the SPO web servers. This AFD provides performance boost to the user request. Once reaching
the SPO web server, the user request will be processed by a set of dedicated servers called the USR.
These web servers will call into other dependent services such as Azure Redis Cache, SQL Azure or Azure
storage to obtain the information necessary to serve the user request. Once it gathers all the
information required, it assembles it into an HTTP response and return it to the customer.

Once the user content is stored in the SPO service, many internal workloads will process that content.
For example, the Search service will extract the keywords and build search catalogs out of it. The MeTA
service will build image thumbnails and video previews for customers to consume in the future. Various
jobs will run on the BOT machines at the present time interval.

Understanding SPO v2023 24 Microsoft Confidential


4.3 Virtual Machine Topology
SPO workload runs on Virtual machines, which is hosted on the physical machines as described in
previous sections. Virtual machines are deployed via Grid Manager jobs and placed on physical machines
using HyperV, a Microsoft hardware virtualization technology. The deployment logic strives to ensure
redundancy across physical machines and racks.

All physical machines and virtual machines are joined to the a few Active Directory domains, such as
YLO001 domain. These AD domains are different from the AD forest used to store tenant accounts
which are called SPODS domains. Two-way trust is established between the resource domain AD servers
and each SPODS AD server.

A VM role refers to a type of VM in SPO which performs a certain set of tasks. Usually, multiple VMs of
the same role are provisioned in a farm as a scale out measure and to provide redundancy. The most
common roles in SPO are listed below.

VM Role Description
USR Responsible for serving customer requests
BOT Responsible for running internal workloads such as search crawl and timer jobs
GFE Front end server of the Grid manager service, also runs jobs
SPD AD machine, part of the SPODS AD forest
DSE Local DNS server
DFR DFSR server for infrastructure content replication
WSU Windows Update server
WDS Windows Deployment Server, used for PMachine imaging and IP address allocation
(DHCP - Dynamic Host Configuration Protocol)
MOR Debug boxes
TMT The VM which facilitates communication from Merlin

Understanding SPO v2023 25 Microsoft Confidential


The following diagram shows the VM hosted in a typical SPO rack, which hosts about 30 physical servers.
Note that different resources allocated by VM roles are reflected by their different sizes.

4.4 Stamp Topology


Farms are a unit of deployment and management: a set of virtual machines performing certain roles
that get deployed and managed as a unit. Farms are tracked by Grid Manager. The following farm types
exist in the service:

Farm Role name Description

Content Primarily serves user requests to the content DBs.

Federated Search Provide Search related functionalities

InfraCore Infrastructure support such as local DNS

SPODS Active Directory farm that SharePoint authenticates and authorizes against.

GridManager Grid manager service

Stamps are the atomic unit of scalability in SPO. Each stamp consists of a content farm and its associated
SPODS farm. SPO strives to create uniform sized stamps with a few exceptions. A standard stamp
typically consists of 160 USR, 65 BOT and 5 SPODS. Those numbers are subject to change as the
hardware SKU changes.

A zone is the SPO’s hardware purchase unit. A typical zone contains 8 racks and can fit 6 standard sized
stamps in it. A large data center can host tens of SPO stamps while a small data center may only contain
one SPO stamp.

A typical SPO stamp can host more than 10,000 tenants. A stamp can scale out its storage to as much as
Azure platform can support. However, it could run out of web server processing power due to the fixed
size of standard farm configuration and physical limit of the Zone. When a stamp is close to full, SPO
starts moving tenant out of one stamp into another.

Understanding SPO v2023 26 Microsoft Confidential


The following diagram shows a typical SPO zone with various farms hosted in it.

Understanding SPO v2023 27 Microsoft Confidential


Chapter 5 Grid manager
5.1 Grid Manager
The Grid Manager is the central controller responsible for deploying and maintaining the SPO service.
This includes deploying other Grid Manager farms (called Regional Managers). Each Grid Manager uses
SQL Azure as a highly available backed service to store data.

Bears multi-regional high availability and scalability design principle in mind, Grid Manager has a high-
level architecture illustrated as below diagram. Using a DNS and load balancer, the Grid Manager
frontend consists of a few active-active compute-only farms that serve the traffic across multi-regions.
Under the hood of the backend DNS (SQL failover group), the Grid Manager backend comprises the
primary, disaster recovery (DR) and standby SQL farms. At any given point in time, the read-write
primary SQL serves most of the traffic to guarantee a strong data consistency, and the DR SQL serves
only low priority read-only traffic. Continuous Geo replications are setup from primary to DR and
Standby SQLs to provide required Geo redundancy for business continuity, the SLA is backed by Azure
SQL RTO and RPO .

Each Grid Manager has several primary components including:

• Work Engine
o Job Agent
o Managed Jobs
o Job Scripts
• Infrastructure Managers
o Topology Manager
o Machine Manager
o Database Manager
• Secrets Manager
• Tenant Information and allocation
• Change Tracking

Understanding SPO v2023 28 Microsoft Confidential


The Grid Manager exposes an ASP.NET ASMX SOAP web service on a central bank of front-end machines
(called GFEs) that connect to SQL azure servers for storage (called GBEs). A single instance of the Grid
Manager manages the world-wide service in all datacenters. The GridManager APIs are designed to
work well in the context of a massively scalable global service. They fundamentally assume that any
network request might fail and/or hang in transit. The Grid Manager itself is designed to do very little
processing inside its own code before returning a response to any given request–usually less than 50
milliseconds, ideally less than 10 milliseconds. Typical APIs involve updating records in Grid Manager
databases and little else. Any operation that would take longer is scheduled as an asynchronous job.

All Grid Manager APIs should be idempotent. For example, there is no AddTenant API in the Tenant
Manager. Instead, there is an UpdateTenant API. The first time UpdateTenant is called for a given
tenant, the web service “updates” the tenant into existence. Subsequent calls change the properties of
the tenant. Therefore, multiple calls to UpdateTenant produce a converging result, regardless of success
or failure of individual calls.

This infrastructure provides good flexibility and cross-version capability, easy ability to interoperate with
all kinds of other services and calling agents, and easy debugging. The various web methods in the APIs
follow two main patterns: Gets and Updates. Gets fetch information. Updates can be more complicated
since the work to carry them out can take some time. Most Update methods take an object as the input
and return the same object as the output. The output object would return the current state of the
underlying object in the database, potentially differing from the input if the Update changed some
properties. For example, record IDs calculated by the GridManager will be filled in. Update methods are
used for both initial object creation and for subsequent updates. Therefore, callers can simply specify
the final state they want the object to be in and don’t need to keep track of whether the object already
exists or not.

5.1.1 Work Manager


The Work Manager is a simple job scheduling and execution system. It allows long running and
complicated actions, provides a persistent record of jobs executed, allows automatic retries on failure
and supports scheduling jobs to run on a schedule. The majority of jobs executed in the Work Manager
are related to tenants (external customers), SharePoint deployment, SharePoint upgrade, machine
management, and diagnostics/joblets.

The service schedules jobs using an algorithm that provides a limited degree of fairness between
networks. Because most jobs scheduled for execution in the Work Manager require resources within an
individual farm, ensuring effective concurrency across all networks leads to an improvement in service
efficiency. The Work Manager creates a thread pool in the W3WP.EXE process on each Grid Manager
front-end (GFE). Each thread then sits in a loop where it dequeues a job, then will find an acceptable
(target) machine to run on and assign the job to the appropriate machine. The job agent service on the
target machine will acknowledge the assignment, download all files needed to execute the job, create a
new process for the job, and executes the job. The job agent maintains a thread on the target machine
that monitors the local job until it finishes.

Understanding SPO v2023 29 Microsoft Confidential


5.1.2 Jobscon
Most actions carried out for deployment and provisioning are implemented either as PowerShell scripts
or managed code (C#) invoked by jobs dequeued by the Work Manager and executed on the target
resource via job agent.

Jobs are associated with a specific version of a job package. Therefore, per build copies of the scripts are
deployed in side-by-side directories in each datacenter. Most jobs will run the latest installed version
(Get-GridJobPackageVersionMap -LatestVersionOnly) but jobs in the LegacySPOJobs package commonly
pick the version of the script to execute based on the associated VM.

When a job fails, or the allotted time for the job expires without completion, the Work Manager will
typically retry the job, starting the job on the last step that was run. The Work Manager therefore relies
on jobs being idempotent. If the job repeatedly fails over an extended period of time, it will be
suspended. When a job is suspended it is no longer executed and is put on a list of jobs for the
engineering teams to examine.

Each job has an associated job type, which is a 16-character maximum name. For Powershell jobs this
matches the name of the script file. For managed jobs this is the name of the class or can be overridden
in ManagedJobOptionsAttribute. If the length exceeds 16 characters, it is truncated. Joblets are the third
type of Grid job. Internally this is a special type of Powershell job (InvokeJoblet) that hosts the Joblet
framework and exposes its own job-like API.

Jobs run 100% in job agent mode, meaning every job must have a target that identifies how to run the
job. In many cases the job system can infer the target based on the input parameters if the target is not
specified explicitly. There are three types of execution targets:

1. Machine. These always target a single machine (physical machine or virtual machine). The job
will always run on the specified machine. These types of jobs generally do things like system
configuration, troubleshooting/debugging, and installing or upgrading a component on that
machine.
2. Logical resources. These jobs operate against a particular resource, for example database,
tenant, or farm. The job framework will automatically select the appropriate machine targets
and load balance.
3. GFE. Jobs here generally fall into two categories: Orchestration jobs, which simply stated are
jobs that run a train or operation on a particular farm or zone resource. They create child jobs
on the child targets and monitor their progress. These orchestration jobs generally do not
interact with anything other than the GM web services and do not need to run on a specific
target. In a sense the GFE target functions as a default host for jobs. The second category is
high-privileged jobs. Jobs running on GFEs can access sensitive secrets that are not accessible
from other targets. For example, Azure resources are generally provisioned from the context of
a GFE so that the Azure admin credentials are not exposed to the content farm.

Jobs always run in a domain less mode. In domain less mode jobs run as a virtual account that is an
administrator on the local box. The job will use certificate identities when communicating with
GridManager (GM) and Sharepoint. In domain less mode it is still possible for the machine to be domain
joined, in which case the computer account (ComputerName$) is used for off box calls while interacting
with domain resources.

Understanding SPO v2023 30 Microsoft Confidential


When a job is suspended the on-call team should receive an alert that includes the job type, which then
allows the on-call team to execute a troubleshooting guide for that job type in order to determine why
the job is repeatedly failing. Jobs that are not investigated in a timely matter will have the data lost and
the job disabled from being created.

Best Practices for Grid Manager Jobs

Sample powershell job scripts can be found in


https://aka.ms/GridMgr?path=/src/grid/Scripts/GridManager/FrontEnd/Job .

Sample managed job can be found in https://aka.ms/GridMgr?path=/src/grid/JobPackages/GridMgrJobs

There are several best practices for writing Grid Manager jobs:

• Avoid writing new Powershell jobs if possible.


• Update the contact information (#FILE, #DEV, #TEST, #SUMMARY).
• Jobs must be idempotent.
• Design with the principles of recovery-oriented computing in mind as espoused by Dave
Patterson.
• Implement job steps, include appropriate sleep intervals between steps, and refine the sleep
interval over time.
• Monitor the effect the job has on the system including:
o Monitor the job creation, retry, and suspension rate of jobs to ensure it is healthy and
uses expected resources.
o Monitor Geneva dashboards for SQL and Networking load.
o Distinguish between transient errors where the Work Manager should retry the job, and
critical errors that mean the Work Manager should immediately suspend the job.
o Provide a troubleshooting guide for every job type so that on-call engineer (OCE) can
resolve issues for that job type without involving the Engineering teams.
• Jobs can ask for more time to execute. Long running operations or steps must ask the Work
Manager for more time or risk termination. It is not recommended to ask for a large amount of
time upfront. If the job unexpectedly hangs the job system will not realize the job is stuck until it
expires. Use KeepAlive instead.
• Jobs may obtain locks objects to ensure that no other script block that needs the same lock will
execute concurrently.
• Be careful about locking objects. Consider whether checking object status could accomplish the
same goal.
• Ensure all locks are released in the jobs’ exception handler. Failing to release a lock will prevent
future jobs from accessing the locked entity.
• Ensure all objects and PowerShell sessions are released in order to avoid resource leakage.
• Do not call Set-GridJob directly. Instead use functions defined in GridJob.ps1 such as
RetryGridJob and AdvanceGridJob
• Be aware that every time a job retries or transfers it is run in a new process. Process state is not
maintained between retries, and a job in the suspended state does not imply the process is
physically suspended in memory. To save state between retries use the job property bag. Do not

Understanding SPO v2023 31 Microsoft Confidential


put initialization code in step 1 (unless the job has just one step). If the job advances past this
step and retries the initialization code will not run again.

5.1.3 Infrastructure Manager


The infrastructure is split into three main components, Topology, Machine and Database (db) managers.
All of these managers store the data in the GridMachine database.

The Topology Manager is used to keep track of the high-level logical records of the service. This includes
the region or geography of the hardware, the datacenter, the zones and networks, and the farm objects.
Farms belong to a network; networks belong in zones and zones reside in a datacenter, and each
datacenter lives in a region.

The Machine Manager keeps track of the machines, both physical and logical. It records the physical
machines and their physical groupings in datacenters, the groupings of virtual machines into farms and
farms into stamps. The Machine Manager typically does not have an understanding of the specific
functions of each machine in the topology beyond its role; see section 3.4 (Virtual Machine Topology)
for more information about roles. With a few exceptions, virtual machine configuration scripts are the
only entities in the service that understand what services each role actually runs.

During deployment, the Machine Manager is responsible for picking the p-machine to host a new v-
machine. The v-machine placing logic picks the host by first applying a set of constraints to reduce the
set of possible p-machines, and then applying a ranking order to pick the optimal one. The logic
emphasizes the high availability of v-machines by ensuring that any two VMs of the same role are
spread across failure domains, typically server racks. This ensures that when all machines in a single
failure domain fail that a farm never loses machines of a given role. The logic also attempts to maintain
fast network connections amongst v-machines in the farm.

In a similar fashion the Database Manager picks the appropriate SQL azure location for a new database.
This logic is based on the utilization of the SQL servers in the stamp and the expected IO and capacity
needs of the new database.

5.1.4 Tenant Manager


While the Work Manager and Infrastructure have relatively little logic specific to SharePoint, the Tenant
Manager contains significant amounts of this. The Tenant Manager is responsible for placement of
tenants into SPODS farms, content farms, and content databases. This information is stored in the
GridTenant database.

When tenants are created, they are placed in datacenters in the region requested by the user when
creating the tenant as part of the Office 365 offering. Within the datacenter they are placed on any farm
that is marked in the Grid Manager as being open to new tenants. Farms are closed when they are full,
or because there is another reason that Ops has decided to not place new tenants in that farm such as
an impending upgrade.

Placing tenants is complicated by the variety of sizes of tenancies, from a few users to 20,000, and by
the way in which new tenants appear. When a new tenant appears, there is little or no information on
their eventual size. Therefore, placement decisions are made before the service understands how large
the tenancy will get. This leads to conservative decisions about when to close farms.

Understanding SPO v2023 32 Microsoft Confidential


5.1.5 Security and Secrets Manager
The Security Manager stores secrets such as username/encrypted passwords and certificates
information in the GridSecret database and must be associated with a scope such as Global or per farm.
All secrets need to be able to support rotation at a moment's notice. All clients of these secrets must
support being able to use the current and new secret at any given time.

Certificates are the preferred way to communicate between servers since they are more reliable and
easily revocable if the secret is compromised.

For Jobs, job authors must notify the system at job creation time which secrets scoped are needed. The
secrets for jobs are specially encoded and have only a five-minute life span for retrieval during the job
execution. A job asking for the secrets after five minutes will result in an error.

5.1.6 Change Tracking


Change tracking is used to help determine what potential actions occurred to help investigate an issue in
production. Events such as upgrading, security patching, user introduced actions, and so forth are
tracked. This data lives in the GridTracking database and can be retrieved using Get-GridChange and
Get-GridObjectChange.

leads to conservative decisions about when to close farms.

Understanding SPO v2023 33 Microsoft Confidential


5.1.7 Traffic Load Balancing and Multi-GM Farms Infrastructure

The diagram above briefly shows our current Gridmanager traffic workflow. In each Gridmanager farm
we have an independent pair of physical load balancers, which will route the incoming traffic to GFEs in
round robin order. Our local DNS server will store all farm load balancers’ IP addresses, so that when a
client calls Gridmanager via CNAME (gme2.yloppe.msoppe.msft.net for example), it will reach one of
the farms. If for some reason the farm is not available, the client is still able to use the service by re-
querying the DNS server to call a separate Grid Manager farm.

A load balancer monitor is in place to check each GFE’s health. Every 5 seconds the load balancer will
send a simple http get request to each GFE. If GFE does not response the load balancer will
automatically take the GFE out of traffic so that client will not reach to any bad GFEs. Whenever the GFE
is healthy again it will start taking traffic automatically. If the monitor detects too many unhealthy GFEs
it will trigger an alert that our on-call engineer will start investigating the issue.

Understanding SPO v2023 34 Microsoft Confidential


5.1.8 Data Caching
In Gridmanager service there are several caching techniques used to increase reliability.

At the SQL layer GridManager uses a third read-only replica in addition to the primary/failover.

The GFEs uses a pluggable data cache interface that supports any serializable object type and multiple
caching options. The current implementation uses an in-memory cache, Azure Redis Cache, and local
disk. The Redis cache layer allows the GFEs to store cached data across several machines while avoiding
potential high latency for cross-farm/region SQL queries. If for any reason Redis Azure cache is down the
GFE fallbacks to a local in-memory cache temporarily, and switch back to Redis cache whenever it is
available. Objects, like secrets, have the option to only use in-memory caching with encryption.

Currently, clients such as jobs, use an in-memory local cache and can notify the server which cache the
call prefers such as the GFE cache or SQL replica.

5.2 Deployments
5.2.1 Upgrades

5.2.2 General Upgrade


The primary upgrade mechanism by which SPO is upgraded is the weekly GU (originally ‘Gemini Update’,
now called ‘General Update’). The GU starts with a code repository fork on Tuesday evenings at 7PM
Pacific Time. This build is then rolled out to the WW fleet over the next 14 to 17 days, in accordance
with Safe Rollout Policies. In the first waves are EDog, and SPDF, then 50% of MSIT. After a proving
period, the rest of MSIT is deployed. Following another proving period, the build is progressively rolled
out to the rest of the WW fleet, hitting 95% saturation in 14-17 days, and 100% saturation in about 28
days.

Understanding SPO v2023 35 Microsoft Confidential


Upgrade phases:
In each farm that is undergoing upgrade, the rollout consists of several phases, described below:

• Preloading- all of the files and other artifacts needed for upgrade are copied onto each VM in
the farm. This is done without taking the VM out of service rotation, as it is minimal impactful to
the operation of the VM.
• InPlaceUpgrade (IPU)- also known as patching. In this phase, the VM is taken out of service, new
binaries are copied into place, the VM restarted and placed back into service. This is done on
small batches of VMs on a rotating basis, and is done during off-peak hours to minimize
operational impact.
• Full-Uber- during this phase, all of the SharePoint upgraders and other maintenance processes
are run. This is where SharePoint objects and configs get updated.
• DOU (Database Only Upgrade) – This is the final phase of upgrade, where the SQL databases for
the farm are upgraded. In this phase, schema changes, SPROC updates and all other SQL-specifc
changes are rolled out.

5.2.3 Hotfix deployments


Hotfix deployments are targeted fix releases. Hotfixes are checked into dedicated branches for the
release they are fixing and deployed via a process that uses just the IPU mechanism that is also part of
GU. As such, they are binary-only. They are delivered on an accelerated schedule that hits 95%
saturation in 48 hours, and 100% saturation in 96 hours.

5.2.4 SecurityPatch and UpdateMonitors


The Security Patch and Update Monitor orchestrators are special purpose systems to deliver Security
and Monitoring updates to the entire fleet, including Physical Machines.

The Security Patch orchestrator runs once each month, starting on the 2nd Tuesday of the month (“Patch
Tuesday”), and hist 95% saturation in around 14 days, with full fleet coverage in 28 days.

The Update Monitors orchestrator runs every 2 weeks and delivers monitoring and other operatability
upgrades to the fleet.

These orchestrators run on a Zone basis, as they update both Physical and Virtual Machines.

5.2.5 Loop Builds


“Loop” builds are interim builds that are produced from the main branch of the monolith repo. These
builds are produced every 3 hours (unless there are NO changes in the repo- rare). These builds are
deployed to EDog and SPDF immediately, via a patching orchestrator (i.e. binaries only, no DOU). This
allows developers to exercise their changes in EDog and SPDF nearly immediately.

Understanding SPO v2023 36 Microsoft Confidential


Chapter 6 Traffic Routing
Every customer’s data is hosted in one or more content farms. When a customer’s tenancy is
provisioned, they will be given a domain name such as Contoso.sharepoint.com. Every file or list
belonging to the customer is referenced by a URL which starts with that domain name. Customer can
use such URL to access data in many ways, including but not limited to:

• Enter a URL in the browser


• Enter a URL into an office rich client
• Sync client

In all cases, a customer’s request is turned into one or many HTTP requests by the App being used. What
happens next is,

1. The APP does a DNS query on the customer’s domain name and gets an IP address which either
points to the Microsoft Azure Front Door (AFD), or the end point of a SPO Content farm
2. The HTTP requests are sent via SSL connection to the end point
3. If the end point is AFD, the request will be forwarded to the correct SPO content farm via a
warm connection. AFD will act as a reverse proxy and relay the response back to the customer
4. The content farm load balancer receives the request, and picks one USR role to serve it
5. The USR machine terminates the SSL, calls the Config/Sitemap service in the farm which does a
mapping from the URL to a database
6. The USR machine does a series of queries to the database, and if necessary, also calls the Azure
blob service to load data. The USR machine will assemble the response and send back to the
customer via Direct server return.

The next few sections will examine these steps in more detail.

6.1 SPO DNS structure


Let’s say that a user is trying to access to this file in the browser https://microsoft-
my.sharepoint.com/personal/ziyiw_microsoft_com/Documents/UnderstandingSharePointOnline.docx.
The browser will send a HTTP GET request for this document. But before a HTTP request can go
anywhere, the browser needs to call the public Domain Name System (DNS) servers translate the Fully
Qualified Domain Name (Microsoft-my.sharepoint.com). The translation result is far from simple1. One
can use NSLOOKUP or DIGWEBINTERFACE to resolve Microsoft-my.sharepoint.com, and will see a result
similar to the table below.

1
Customer in general prefers simpler DNS structure. The complex CNAME chain of SPO DNS records is something
we are looking to simplify in the future.

Understanding SPO v2023 37 Microsoft Confidential


No Source TTL TYPE Target

1 microsoft-my.sharepoint.com 3600 CNAME microsoft.sharepoint.com

2 microsoft.sharepoint.com 3600 CNAME 121-ipv4v6e.clump.msit.aa-


rt.sharepoint.com

3 121-ipv4v6e.clump.msit.aa- 60 CNAME 82037-ipv4v6e.farm.msit.aa-


rt.sharepoint.com rt.sharepoint.com

4 82037-ipv4v6e.farm.msit.aa- 60 CNAME 82037-


rt.sharepoint.com ipv4v6e.farm.msit.sharepointonline
.com.akadns.net

5 82037- 300 CNAME 82037-ipv4v6.farm.msit.aa-


ipv4v6e.farm.msit.sharepointonline.co rt.sharepoint.com.dual-spo-
m.akadns.net 0003.spo-msedge.net

6 82037-ipv4v6.farm.msit.aa- 300 CNAME dual-spo-0003.spo-msedge.net


rt.sharepoint.com.dual-spo-0003.spo-
msedge.net

7 dual-spov-0005.spov-msedge.net. 300 A 13.107.136.8

13.107.138.8

Each row in this table represents a step in resolving the domain name to its destination. A CNAME
record maps one domain to another. An A record maps a domain name to an IP address. The TTL column
stands for “Time Til Live”. It specifies how long client computers can assume a given record will remain
constant. Typical TTLs range from a few seconds to a few hours. Usually, SPO sets a relatively high TTL
value if the record is not expected to change often, while a low TTL value indicates the record is
expected to change frequently. For example, record no. 4 and 5 in the above table shows the domain
name which will change value during a “failover”, which SPO uses to recover from any farm, database or
network outage. Once this happens, the customer needs to be aware of the new destination ASAP.
Hence such a record should be kept for no more than one minute in the local DNS cache.

There is a purpose for each of the redirections in the DNS resolution

Record 1-2: These records consolidate the various SPO services, such as ODB, Tenant Admin, and Team
sites, into one canonical domain name. These records are created during the tenant provision time and
are hosted in Azure DNS service

Record 3-4: These records first redirect an individual tenant record to a clump of databases, and then
further redirects to the farm hosting that database clump2. These record help facilitate database or farm
level failovers. These records are also hosted in Azure DNS service

2
The database clump concept is explained in more detail in the “Storage” chapter.

Understanding SPO v2023 38 Microsoft Confidential


Record 5-7: The first row is a GTM record hosted in Akamai service. The purpose of this is to determine
whether this request should go through Azure Front door3, or should be sent directly to the farm. In the
former case Record 6 and 7 re hosted in Microsoft Azure Front Door service, based on which the AFD
service can find out which SPO farm it should forward the request to. In the latter case, the last 2 rows in
the DNS resolution result will look like the following table, which will bring the request directly to a SPO
farm.

No Source TTL Type Target

6 82037- 300 CNAME 82137-ipv4v6.farm.msit.aa-


ipv4v6e.farm.msit.sharepointonline.co rt.sharepoint.com
m.akadns.net

7 82137-ipv4v6.farm.msit.aa- 3600 A 40.108.223.53


rt.sharepoint.com

The overall SPO DNS hierarchy and how it integrates with AFD (Edge) is illustrated in the figure below.

Figure 6-1 SPO DNS hierarchy

6.2 Inbound Routing


Some client computers are connected directly to the internet. Others live inside private networks, which
may have Network Address Translation (NAT) enabled, custom DNS servers, and proxy servers. Proxy
servers allow controlled access from within a private network to the internet. Traffic intended for the

3
Sometimes, it is preferrable the traffic does not go through AFD. One such example is a microservice calling SPO
from a location which is closer to the farm itself than the Edge box.

Understanding SPO v2023 39 Microsoft Confidential


internet is routed to the proxy server by private DNS servers within the private network. The proxy
server accepts the connection and then creates a new outbound connection to the internet as a proxy
for the original request. Therefore, traffic from many client computers behind a proxy server will appear
to have the same original IP address (or small set of IP addresses).

Tenant-level DNS records dictate which endpoint a customer connects to – AFD or Direct. By default,
most customers are routed to AFD endpoints for better performance (TCP proxy)

Every Sharepoint FQDN will be resolved by DNS to an A or A/AAAA record, which represents the IP
address of the load balancer devices. Each content farm utilizes a hardware load balancer device (NLB)
to accept incoming requests. Most of the NLBs used in SPO farms today are manufactured by F5
Networks in the BIGIP product family. The NLB maintains a list of available content front ends (USRs) for
each farm, called a pool. The NLB performs load balancing at L4 (TCP) to USRs via basic round-robin
algorithm (non-persistent). When an inbound request arrives, the NLB then randomly selects a USR from
the pool and routes the request to it. The NLBs are not programmed to enforce any user, tenant, or
session affinity. Thus, requests from a single user session could potentially be routed to different USRs
within the content farm. TCP and SSL sessions are terminated on USR VMs; no TCP tuning, auth, or certs
on the NLB. NLB probes VM health via Circuit Breaker monitors. The return traffic bypasses the NLB, aka
Direct Server Return. More information about SPO DNS routing could be found here.

6.3 Outbound Routing


When network traffic leaves a SPO virtual
machine, it will take one of the following 3
paths:

1. The traffic goes through the network


load balancer device, and the source
IP will be replaced with a public
SNAT address. This path is used
when the VM is trying to reach
Internet routable IPs. This route is
shown by the red arrows in the
graph.

2. The traffic goes to the GMR/L3AGG router directly, and it will not use any SNAT address. As a
result, such traffic can only be routed within the Microsoft network. The traffic which chooses
this path typically reaches one of the following destinations. This route is shown by the blue
arrows in the graph
a. Azure Storage/SQL/Redis for Content farms
b. SPO Backend blocks/IPs (server-to-server)
c. SPO MGMT blocks/IPs (DPROD MGMT)
d. Secure Workload Environment (SWE)
e. Time targets (NTP)

3. The return traffic to end user request will go to the router directly. It uses the Direct Server
Return configuration on the web server so that the return packets will not go through the

Understanding SPO v2023 40 Microsoft Confidential


network load balancer. The kernel driver on the web server has logic to replace the destination
MAC to forward traffic to the L3AGG Switch instead of the network load balancer.

One important consideration why there is more than one outbound routing path is because the network
load balancer has limited bandwidth. It will quickly become a bottleneck if all the outgoing traffic were
to go through it. The L3AGG router has a much higher bandwidth than the network load balancer.

6.4 Azure Front Door


Azure Front Door (AFD) is a modern cloud content delivery network (CDN) service that delivers high
performance, scalability, and secure user experiences for SPO requests. The main design goal is to
connect the end users to the closest web servers to reduce latency.

AFD uses anycast to route the end users’ traffic to the closest web server farms. Many SPO customers
have a global presence, and their users are located across many countries. The end users would connect
to the closet AFD environment.

When the AFD web server farms have received the end user requests, AFD would forward the request to
SPO content farms.

Understanding SPO v2023 41 Microsoft Confidential


For a new connection, AFD reduces the end user’s connect time and SSL time. AFD keeps the connection
pools warm to SPO.

Request without AFD

Request with AFD

AFD allows the network connection to recover faster if there is packet drop in the end user connection
and improves the overall throughput of any file operation.

Most requests going through AFD are forwarded to SPO without being cached. Some content requests
can be cached for limited time and AFD would return the cached content without forwarding the
request to SPO. SPO uses signed parameters in the URL to control the content caching and how long the
caching would work. SPO signs the URL parameter using the private key and the AFD would use the
public key to validate the parameters are signed by SPO.

6.5 Global Traffic Management


Global Traffic Manager (GTM) provides control of distributed incoming application traffic across
distributed number of end points of SPO. There are multiple end points for the end user and
applications to connect to SPO for the purpose of redundancy. SPO will distribute these end user traffic
to the optimal end points.

SPO uses Akamai as the GTM for the following purposes:

Understanding SPO v2023 42 Microsoft Confidential


1. Regional failover
2. AFD bypass for server to server

Based on the current configuration, the Akamai GTM entry will conditionally tell the client a DNS entry
to route the traffic to the farm direct traffic or to route the traffic to AFD

This table shows 2 options for the GTM DNS entry to conditionally return different destination of either
farm direct or AFD end point.

No Source TTL TYPE Target

5a 82037- 300 CNAME 82037-ipv4v6.farm.msit.aa-


ipv4v6e.farm.msit.sharepointonline. rt.sharepoint.com
(Farm
com.akadns.net
direct)

Or

5b 82037- 300 CNAME 82037-ipv4v6.farm.msit.aa-


ipv4v6e.farm.msit.sharepointonline. rt.sharepoint.com.dual-spo-
(AFD)
com.akadns.net 0003.spo-msedge.net

For regional failover, SPO would define which country would return farm direct and which country
would return AFD. For example, if SPO knows United Kingdom’s AFD nodes cannot route traffic, SPO
would do a regional failover of the United Kingdom to allow United Kingdom users to connect directly to
the farm direct and allow the rest of the world to continue to connect through AFD.

For AFD bypass, SPO has defined an AFD bypass Classless inter-domain routing (CIDR) map. Based on the
IP subnet of the Azure network, the Azure DNS servers would be returned a farm direct traffic end point
for server-to-server traffic.

6.6 Microservice Traffic Routing


SPO uses microservices to enhance new scenarios and new processing capabilities for existing data
stored in SPO. Each microservice is a unit of feature extension for each piece of data stored in SPO.

For example, the meta microservice is used to process different encodings of video files. The push
microservice is used to process notification to the end client.

Each microservice has its own DNS domain and server capacity in Azure. The end user would
authenticate against each microservice end point and then authentication would carry over to SPO to
access the user’s data.

Understanding SPO v2023 43 Microsoft Confidential


Understanding SPO v2023 44 Microsoft Confidential
Chapter 7 Request Processing
As SharePoint Online has evolved into an internet-scale service, the service has been simplified to
support this larger scale. That simplification means a reduction in the number of web server roles used
in the service to support customer requests. The role that serves all customer requests in SPO is called
the USR role.

7.1 Request Process Pipeline


The USR role contains a set of windows services needed to process end user requests. This includes
some supporting services like the timer service, but the most fundamental is the content web service,
which is an IIS-based web server running ASP.NET version 4.0.

The easiest way to understand the USR role and the flow of a request is to walk through what happens
during a request. Some of these areas will be mentioned here and covered in more detail later in the
document.

Requests are processed in a series of event callbacks between the layers of the architecture. After
requests are dequeued from the asp.net queue, a series of SharePoint and platform modules are called
to handle these events, after which, execution returns to the underlying platform to handle core
functions such as serving file contents. A small portion of the file contents are served from static files
lived on the USR role, such as those live under _layouts folder. But the vast majority of the files are
served as virtual files, where the content was read from SQL and Blob service and assembled on the fly.

Figure 7-1 shows a high-level relationship of the components running on a USR machine, where the
system components and SPO components are depicted in different colors. Figure 7-2 shows more details
of the important steps happening in the Core SharePoint module (SPRequestModule)

Figure 7-1 SPO request processing arch

Understanding SPO v2023 45 Microsoft Confidential


Some key events in one of the core modules (Request Processing) are summarized below to illustrate
this flow. These are a great place to start stepping through any stage in the request to better understand
the program flow.

• Initializes logging and metrics


• Rewrites some URL Paths
Begin Request • Initial Throttling

•See Chapter 8 Identity and Compliance below


Authenticate
Request

• User/App-quota based Throttling


Post • Additional checks requiring a known user
Authenticate
Request

• Rewrites URL paths for routing to the correct ASP.NET handler. With some exceptions (static files, soap
Post requests, etc), this will usually be the native code handler that services content out of the SharePoint Content
Authorize Store
Request

• Adds SharePoint-specific http headers to be sent to the client.


Pre Send
Headers

• Intercept unhandled exceptions and write an error page, if necessary


Error

• Tally and log resources used


• Finalize logging
End Request

Figure 7-2 SPO request processing modules

7.2 Request Attribution


Each request contains a set of information, including request headers, authentication tokens, and
other information that allows us to identify a request, it’s caller and caller’s intent. This identification is
used for observability (telemetry, monitoring and alerting), business intelligence reporting, cogs
tracking/reporting. These attributes are also propagated to our telemetry systems to allow for
consistency in how we manage and view request information.

Partners can help identifying themselves and their intent for making requests using request
attribution guidelines found at RequestType Dev Design.docx (sharepoint-df.com)

Understanding SPO v2023 46 Microsoft Confidential


For example: OneDrive sync client attributes requests made while opening a cloud file on a
user’s device with 3 request headers: Application: OneDriveSync, Scenario: ActiveHydration,
ScenarioType: AUO. Telemetry generated using these request attributes is here.

7.3 Service Health Protection Throttling


Service Health Protection engine throttles incoming request traffic using health scores of
various components of SPO stack. Health Score is measure of (un)healthiness of a component
such as USR, Content DB, Sitemap DB and Network. Health throttling uses a priority system to
minimize the impact on end user experiences. It starts throttling less time sensitive workloads
first, such as background operations, and gradually escalates towards end user traffic if the
service health continues to degrade. Various request attributes such as URL, headers, identities
from auth tokens and caller provided request attribution (refer to Request Attribution section
for more info) is used for classifying incoming requests.

At very high level, various workloads are classified as follows for the purpose of health
protection throttling:
Health Classification Workloads
Throttling
Priority
Minor Background activity Backup Application requests, Data Loss Prevention
that is least time (DLP) applications requests, Internal system jobs
sensitive processing async work units, specific scenarios
from 1P apps such as sync verification or profile
photo refresh, etc.
Major Relatively Less time Migration Application requests, OneDrive Sync
sensitive end-user client’s uploads and downloads, Camera Roll
traffic backup from OneDrive Mobile clients, etc.
All end-user traffic is by default classified in this
category.
Critical Any load (including Any and all api calls. For example, time-sensitive
time sensitive end user-interactive flow such as opening a file in Office
user experience online, on-demand downloading a placeholder file
impacting traffic) by OneDrive sync client when user double clicks on
placeholder file stub

7.4 Quota Rate Limits


SharePoint applies rate limiting per each entity, such as users, apps, and tenants, to ensure fair
and reasonable use of the shared service resource. Rate limits can be applied over different
time periods, such as minutes, hours, or days, and different types of resources that are
measured either by API request rate, bytes transferred or by internal service resources (such as
cpu, sql resources, etc.). These resource limits are also referred to as quotas.

Understanding SPO v2023 47 Microsoft Confidential


Currently, there are 3 general scopes of rate limits:
Per User quota: when a request comes in a User auth context, it is subject to the User quota
throttling. Each authenticated user is given limited resources in a time interval. The quota is
shared among all the apps that a user runs at the same time, and users usually only run one app
within a short time window. This quota applies to any user requests, including user-only, 1st
party, and 3rd party.
Per App per Tenant quota: when a request contains an App auth context, it is subject to the
Tenant+App quota throttling. This includes both the App-only traffic and the App+User traffic.
Each app has its own quotas per tenant that scales with licenses tenant has purchased. One
tenant using an app does not affect other tenants using the same app. Within a tenant, for the
App+User traffic, users can affect each other given that the app level quota is shared between
users within a tenant, while the per user quota mentioned above helps mitigate this situation.
Per Tenant for all Apps quota: This quota currently only applies to 3rd party app-only traffic. It
limits the total 3rd party app-only usage per tenant. This quota also scales with licenses tenant
has purchased. 3rd party app-only apps require the tenant admin's consent before being able
to run on behalf of the tenant. SharePoint allows tenants to run 3-4 high-throughput 3rd party
services at the same time, and these apps are contained by their own quotas (per App per
Tenant quota). However, if a tenant is running too many high-throughput 3rd party services at
the same time, the aggregated load from 3rd party services may trigger this quota.
More information about rate-limiting is found at Throttling - Substrate Dev Center
(microsoft.net)

7.5 Circuit Breaker


Circuit Breaker (aka CB) is an essential component in effectively handling user traffic in a farm. Circuit
Breaker is a Windows service which is called HealthMonitor.exe and is deployed on every VM (e.g. USR,
BOT, etc.) in the system. It works by assessing the health of each machine in a farm and assigning one of
three states based on the findings: Healthy, Sick, Dead. Healthy machines will stay in rotation, sick
machines will be taken out of rotation to be healed, and dead machines will also be taken out of
rotation to be harvested (i.e., replacing this VM with brand new VM). However, it always ensures that
enough machines are left in rotation (so we don’t end up taking too many machines out of rotation
which could destabilize the farm further).

Main components in Circuit Breaker:


1. Health Checks: HealthMonitor.exe service runs every 5 secs on every VM in the system and
validates the system by running several checks. These checks are run as part of “Probes”. List of
the probes/checks are mentioned later in the document. At the end of the health checks,
machines are tagged with one of the following states:
a. Healthy (good to stay in rotation)
b. Sick (needs to be taken out of rotation)
c. Dead (needs to be harvested)

Understanding SPO v2023 48 Microsoft Confidential


d. Sleep (special state which is used for True sizing. This state keeps VM out of rotation,
but health checks are performed against it to keep it ready to go in rotation.) Only a
small subset of farms will use this feature and state.
2. Healing Actions: CB takes the following healing actions on machines – Restart IIS, Restart VM,
Replace VM (i.e. harvesting).
3. Rotation Manager:
a. Rotation Manager communicates with F5 devices connected to farm via
HealthStatus.aspx. This page communicates whether to keep a machine in rotation or
not. F5 pings this page every 15 seconds.
b. All gestures which require VM to be taken out of rotation (e.g. deployment, harvesting,
etc.) need to go through CircuitBreaker. SetVMStateAtCB/Set-GridVMStateAtNLB will be
called to change the state of the machine before taking any action on them. Failure to
get response from that API communicates that the machine cannot be taken out of
rotation.
4. SafetyNet:
a. Circuit Breaker always makes sure enough machines are in rotation. To ensure that
there is always a sufficient number of machines left in rotation, there is a feature named
SafetyNet, which defines a threshold for how many machines can be taken out of
rotation in a farm at a time. Due to this, there may still be a few non-healthy machines
in rotation.
b. SafetyNet has peak-time and off-peak values as we can take more machines out during
off-peak hours. Current peak-time SafetyNet value is 93% (means 7% machines can be
taken out of rotation). Current off-peak SafetyNet is 65% (means 35% machines can be
taken out of rotation).
5. HealthScore: Based on results of Health checks, CB calculates score from 1 – 10. 1 shows
machine in their healthiest state. Higher numbers show worsening health. 10 shows the worst
health. As HealthScore increases, attempts to manage farm health will progress from CB
attempting to run healing actions to SharePoint throttling end-user requests.
6. Priority Queue: Sick machines are taken out of rotation based on how bad their health is. This is
done using a priority queue which is maintained in Redis cache. Machines with worst health are
taken out of rotation first. Weighted health scores are pushed into Redis priority queue to make
the decision of which machine needs to be taken out of rotation first.
7. Peer VM monitoring:
a. Sometimes a VM can get into a really bad state that even CircuitBreaker/
HealthMonitor.exe running on the machine cannot take any healing action. These
machines end up staying in rotation for a long time and affect health of the system.
b. To handle these stuck machines, peer VM creates list of leader VMs (i.e. few select
machines which are in their best health) which keep a watch on their neighbors and
take healing actions if they stop reporting their health status for extended period.
8. True sizing:
a. SPO farms do not always use 100% of usable capacity during the peak time. Since we
never run any SPO farm at its max capacity, there is no way for us to know the real
capacity of SPO farms.
b. True sizing component uses Circuit breaker functionality to take machines out of
rotation and increase RPS on remaining machines to bring their CPU consumption to
target value.
c. A special state “sleep” is used to take machines out of rotation for true sizing.

Understanding SPO v2023 49 Microsoft Confidential


Basic Circuit Breaker flow:

Take the
healing Run warm up
Run Health Calculate
Push this actions on probes
validation weighted
data into the worst before
checks/probe score based
redis based health brining
s against the on individual
priority machine (if machine
VM every 5 health probe
queue allowed by back into
secs results
SafetyNet rotation
threshold)

List of TOP checks performed by Circuit Breaker:

Probe Name Comment

SP ping Tests a custom HTTP method that validates select system level state
from within the Content app pool

STS AppPool Checks if important apppools are running or not

CPU Usage Checks CPU consumption on the VM

Garbage Collection Checks % time spent in doing garbage collection

Memory Usage Checks memory consumption on the VM

HomePage Probe Basic SharePoint page validation that ensures the VM can serve basic
SharePoint page content

Understanding SPO v2023 50 Microsoft Confidential


Chapter 8 Identity and Compliance
The Identity team in SharePoint Online is the gatekeeper for every incoming request in terms of
authentication. It owns not just end-user requests but also all Service-to-Service (S2S) calls from other
services into SPO. We are also known as the AuthZen team within the SCORE organization
([email protected]). For more information about Auth, you can reach out to
[email protected]. You can find our latest updates on our wiki here: AuthZen - Overview
(visualstudio.com)

SharePoint Online, like all other M365 Services, uses AAD (Azure Active Directory) as the identity
provider. What this means is that every M365 service doesn’t need to provide a redundant sign-in
experience and can instead rely on AAD’s identity platform and sign-in experience. All tenants and their
associated users as well as licenses are centrally stored in AAD. Other M365 Services can then redirect
all sign-in requests to AAD. And with the incoming token, they can validate the identity in the context of
the tenant. This also implies that M365 services sync these objects from AAD to their own service farms.
In SPO’s case, we sync these objects into our local directory (SPODS as well as SQL SPODS) for faster
access and for reduced calls (less COGS) to AAD’s directory.

There are two major authentication flows – User sign-in, and Service-to-Service calls.

8.1 User sign in


This can be further categorized as:

a) Browser (through OpenID Connect protocol)


b) Desktop native clients like Word (through OAuth protocol) & Mobile app like OneDrive on
iOS. (Through OAuth protocol)

8.1.1 Browser
For the browser sign-in, when SPO detects that there is no FedAuth cookie in the incoming request, it
redirects the user to AAD. At this point, AAD may prompt the user for credentials if the user hasn’t
signed in recently to any of the M365 Services. In some cases, depending on the tenant configuration,
AAD itself may not prompt for a sign-in but redirect the user to the tenant’s own on-premises Active
Directory service (ADFS) for authentication.

As mentioned above, SPO uses the OpenID Connect protocol (Final: OpenID Connect Core 1.0
incorporating errata set 1) and requests AAD for an id_token as well as a code. When the user has
completed authentication at AAD, they are redirected by AAD back to SPO with the artifacts (id_token,
code). SPO then verifies the identity of the user in the incoming id_token by checking this identity in its
local directory (SPODS) and confirming its existence.

If this is successfully confirmed, SPO completes the rest of the sign-in by generating an internal token
(for our own book-keeping), storing this token in the local memory cache as well as distributed cache,
and most importantly, generating a cookie called FedAuth which is a signed representation of some
critical claims. This cookie is issued for the tenant’s domain, say, contoso.sharepoint.com. Henceforth,
as the user interacts with the site, this FedAuth cookie is sent by the browser, and SPO quickly validates
the cookie’s signature and finds the internally stored token from the cache mapped to this cookie, and
sets the user’s identity on the thread, and execution moves on to the page or resource requested.

Understanding SPO v2023 51 Microsoft Confidential


SPO FedAuth cookies have a 5 day lifetime. However, the id_token from which this cookie originated
only has a 24 hour lifetime. Therefore, SPO refreshes the cookie periodically during user requests. SPO
will keep extending the lifetime if the user is active every day. Additionally, SPO also performs an hourly
policy check call with AAD to affirm that the incoming request with the FedAuth cookie is still allowed.

8.1.2 Desktop Office Clients & Mobile Apps


Native Office Clients like Word use an OAuth flow with AAD and SPO (RFC 6749: The OAuth 2.0
Authorization Framework (rfc-editor.org)). When the client connects to SPO with no artifacts and
receives a 401 authentication challenge from SPO, it parses the endpoint in the SPO challenge and goes
to AAD’s endpoint to sign-in the user. A successful sign-in at AAD results in an access token being
received by the client. This access token can have a lifetime of anywhere from 1 hour to 24 hours. The
client then sends this access token in a request to SPO’s SP.Auth.NativeClient/authenticate endpoint.
The access token is validated by SPO in a way similar to what was described above in the browser flow.
A successful completion of this request in SPO results in a persistent cookie being sent to the client
(SPOIDCRL cookie). This cookie is similar to the FedAuth cookie described above for the browser flow.
Henceforth, the native client will use the SPOIDCRL cookie for requests to SPO. Since it is persistent, it is
also used by the Office Clients between reboots of the client. However, periodically, the clients will
throw away the cookie and go to AAD to get a fresh access token. Mobile apps work in a way similar to
desktop office clients. The key difference is that they do not use cookies, and only use access tokens.

8.2 Service-to-Service Calls


Many other M365 services make calls to SPO directly. These are known as service-to-service calls (S2S).
We handle tens of billions of S2S calls a day from other services. The reason that SPO is one of the most
frequently called services is that all other services typically want to access a user’s content stored in
SPO. All S2S calls are done using the OAuth protocol and Microsoft extensions to OAuth.

Say you are in the Outlook Web Client (OWA, purely as an example, and not exactly the way OWA works
but other clients work this way) connected to the Exchange service. You create a new message to a
colleague and wish to attach a file from your OneDrive. What the Exchange service would do is make an
S2S call to SPO for the user’s OneDrive contents and then show these files to the user by returning them
to the OWA client.

S2S calls happen in two possible identities: a) as an app+user identity containing both the App identity
and the user identity. b) as an app-only identity, in which case the user identity is not even present in
the call. This is a high privileged call and is not recommended due to the security concerns with this
pattern. However, if this pattern is unavoidable, you will have to get the appropriate permissions (say
Sites.Read.All) from AAD, and then SPO will additionally grant granular permissions (aka logical
permissions) to specific APIs within SPO (instead of all APIs) to further secure this model. We have an
onboarding and maintenance process for the 45+ M365 partner services that we serve.

S2S calls can also be looked through the incoming/outgoing pivot, i.e. whether the S2S call is incoming
into SPO from another service, or whether the S2S calls is outgoing from SPO to another service. ODSP
Auth Patterns (sharepoint-df.com)

8.2.1 Incoming Into SPO


If you are an M365 Service making calls into SPO, the standard pattern to use is PFT (Protected
Forwarded Tokens). PFTs are tokens issued by AAD for a user using your own service and you will just

Understanding SPO v2023 52 Microsoft Confidential


be transforming these PFTs and forwarding them to SPO. This pattern uses the Authorization:
MSAuth1.0 HTTP request header instead of the usual Authorization: Bearer HTTP request header. The
preference is that Proof Of Possession (POP) is used for these incoming calls to secure the call. This
specifically corresponds to the following header:

Authorization: MSAuth1.0 AT_POP <token>

where <token> is a JWT token containing the pft claim (which contains the actual user token issued by
AAD) and the at claim (which contains the actor token issued by AAD for your app). The JWT token is
signed by your own certificate (hence POP, i.e. Proof Of Possession of this certificate)

The old pattern without POP is not recommended or supported for new scenarios: Authorization:
MSAuth1.0 PFAT

Also not supported for new scenarios are ACS legacy protocol, as well as AppAssertedUser or
ServiceAssertedApp in EVOSTS protocol.

8.2.2 Outgoing From SPO


We currently make outgoing calls to many other services like EXO and Substrate. The recommended
pattern is with EvoSTS and PFT tokens (just as for incoming calls mentioned above). POP support for
outgoing calls is currently in development.

8.3 Resilience To AAD Outage


As can be seen above, we are entirely dependent on AAD, and any outage in AAD can have an outsize
impact on SPO if proper mitigations are not in place. To mitigate any such disastrous impact, we
developed a two-pronged solution to be more resilient to AAD outages.

Resilience Prong 1: Avoid going to AAD

The 1st prong is to avoid going to AAD as much as possible. We do this in 3 ways:

a. For Actor tokens that are needed for outgoing calls from SPO to other services, we not only
cache these tokens (24h lifetime), but also proactively fetch them when they have less than 12
hours lifetime left. This has allowed us to have a 95%+ cache hit rate and allows us to survive
AAD outages that are up to 4 hours long (AAD outages are usually resolved in much less than 4
hours)
b. For access tokens used by native/mobile clients, our support for Continuous Access
Enforcement has meant that these clients receive tokens with a 24h-28h lifetime, as compared
to tokens that used to have a 1 hour lifetime. Our innovation with CAE helps greatly during an
outage because of the longer lived tokens being used by the clients.
c. For browser sessions, we take a more lenient stance during an outage, and provide forgiveness
for expired sessions if they expired in the last 48 hours. This forgiveness allows us to serve the
browser sessions without redirecting the user to AAD during an AAD outage. We forgive about
1 million expired sessions for every hour of outage in AAD.

Resilience Prong 2: Use an AAD backup during outage

The 2nd prong is to have a backup for AAD itself. This is the Credentials Caching Service (CCS) which is
built on the Substrate platform. This solution has been built by AAD itself as a backup solution.

Understanding SPO v2023 53 Microsoft Confidential


The way it works is that whenever a user signs in during healthy AAD periods, AAD itself will post a
token to CCS to be used during a rainy day. During an outage, if SPO redirects the user to AAD, the AAD
gateway will forward the request to the backup CCS, and the token is then returned by the gateway to
the user & SPO. Currently, this AAD backup solution during outages has a return rate of 40% for
browser related sign-ins, and up to 70% for mobile & native clients, thus greatly ameliorating the user
experience during an outage. Also note that the backup solution is limited in the number of tokens due
to security concerns that end up invalidating the cached token – this includes password changes by user
since the token was cached at backup, session revocations due to various signals including policy
changes by admin etc.

8.4 Continuous Access Enforcement (CAE)


SPO needs to respond instantaneously to security posture changes at a user or tenant level. A user
session can get revoked at AAD which means that SPO should instantly revoke the session for that user
at SPO. Continuous Access Enforcement is an innovation from AAD that SPO has implemented in
concert with AAD to deal with this requirement. Prior to CAE, the way that SPO handled user session
revocation events (like password change, user disabled etc.) was to rebuild the user’s token every hour
using local directory and reissuing the cookie if the authentication succeeds using the local directory.
However, administrators required a faster enforcement, and CAE helps make it instantaneous. While we
still continue to rebuild user tokens every hour, we now also react instantaneously to user revocation
events.

The way CAE is implemented is that SPO has its own event hub in Azure. AAD publishes user revocation
events and policy change events into this hub. SPO also built a Microservice (SPAuthEvent) that listens
to the events from the event hub and forwards them to SPO at a REST endpoint (we use MS Graph to
determine the actual portal URL for the tenant and therefore this gets routed to the proper Content
Farm).

The User revocation events supported are: PasswordChange, AccountDisable, UserDeletion,


UserMFAEnable,UserHardDeletion, PasswordReset, UserDeletionGuest, UserAccountRisk. We handle
tens of millions of these events/day within a few minutes from the time of the event.

The Policy events supported are: AddCAPolicy, CAPolicyChange, IpRangeChange

Device compliance events like DeviceNotCompliant are slated for support in 2023.

Understanding SPO v2023 54 Microsoft Confidential


8.5 Compliance & Policies
The Identity team also supports numerous Premium policies that advanced tenants require, like IP Policy
(allowing only IP addresses within a known range), UnManaged Device Policy (allowing managed devices
to do everything, and unmanaged devices to only view documents but not download them). There are
several other policies like Tenant Restrictions V2, Block Download policy by group, etc. We also support
policies more granularly at a site level through a feature called AuthContext which can be configured, as
an example, to require Multi-Factor Auth (MFA) on confidential sites. We have a whole team within
Identity that works on only on Policies. Labels at a site level can also be associated with policies, and the
Identity team owns this Container Labels feature.

For more information, see:

UnManaged Device Policy: SharePoint and OneDrive unmanaged device access controls for
administrators - SharePoint in Microsoft 365 | Microsoft Learn

IP Policy: Network location-based access to SharePoint and OneDrive - SharePoint in Microsoft 365 |
Microsoft Learn

Site level granular policy with Labels & AuthContext: Manage site access based on sensitivity label -
SharePoint in Microsoft 365 | Microsoft Learn

Information Barriers: Use information barriers with SharePoint - SharePoint in Microsoft 365 | Microsoft
Learn

Block Guest access to Sensitive files: Prevent guest access to files while DLP rules are applied -
SharePoint in Microsoft 365 | Microsoft Learn

8.6 OneDrive Consumer Stack (Within SPO)


To support the OneDrive Consumer user, we provide Authentication support for
*.microsoftpersonalcontent.com (i.e., the Consumer tenant) and lists.microsoft.com. The protocols that
we support here are quite different from the protocols for the Enterprise user.

• Identity Provider is LiveID (not AAD as in Enterprise case)


• Login URL is login.live.com (not login.microsoftonline.com)
• Protocols are
o RPS for browser (not OpenID Connect)
o WLID for Office Client (a variant of OAuth flow)

As can be seen from above, SPO runs two authentication stacks to support the enterprise and consumer
OneDrive within SPO farms. These stacks use AAD v1 and RPS, respectively (for browser). RPS is old or
not maintained anymore. It costs a lot of engineering effort to maintain both stacks along with their
protocols. For some new clients they will not even support RPS at all. So, the goal is converging
consumer account authentication and business account authentication by using AAD v2 stack - this
means one stack, one protocol. This also fits into the overall organizational goal of ODB/ODC
Convergence at all levels.

Understanding SPO v2023 55 Microsoft Confidential


We have started the work for convergence by deploying a new Service Application to handle AADv2
protocol that supports both Enterprise accounts and MSA Consumer accounts. Initially, this will support
traffic for lists.microsoft.com and *.mpc.com. Eventually, we will migrate a few sharepoint.com
scenarios to this application as well for early insights. And finally, we will upgrade the existing
SharePoint Service application to V2 at some point in 2023.

Currently, we block all the third party apps direct access to SharePoint consumer tenant and allow only
designated list of 1st party apps or calls proxied through Graph.

8.7 OneDrive Consumer Stack (Within Consumer Farms)


As part of convergence, the AuthZen team officially also owns the technology running OneDrive
Consumer Farms (separate from SPO farms). This is owned by the IDC team currently, and is outside the
scope of this document which focuses purely on SharePoint Online

8.8 SharePoint Online Directory Service (SPODS)


Every SharePoint Online network includes a directory service called SharePoint Online Directory Service,
or SPODS. This directory service contains the directory information, such as licenses and subscriptions,
users, groups, membership information, etc., for all tenants present in this network. The source of truth
for this information is in AAD, specifically MSODS (Microsoft Online Directory Service).

Within MSODS, there are partitions known as service instances (SIs). When a company signs up for an
M365 service, their directory information is placed into some SI for that service, for example a
SharePoint Online license might result in a company being placed into an SI called “SharePoint/APAC-
0007”, whereas an Exchange Online (EXO) license might also result in the company being placed in an
EXO SI like “Exchange/apcprd03-009-01”.

SharePoint Networks to SharePoint SIs is 1:N – each SPO network is associated with one (but in few
cases, multiple) MSODS service instances (SPO networks only associate with “SharePoint/*” SIs, never
with for example Exchange SIs). Each MSODS service instance that belongs to SharePoint is associated
with only one SharePoint network.

Due to the volume of requests that must be made for directory information by SPO (for, among many
other things, authorization (authz)), it is untenable to use MSODS directly to service such requests.
Therefore, it is necessary to maintain a local copy of the data from a given SI in the SPO network that
that SI is associated with. That local copy is SPODS.

More information on how SPODS is structured and how SPODS is kept in sync with MSODS can be found
in Tenant Life Cycle

Understanding SPO v2023 56 Microsoft Confidential


Chapter 9 Storage
The SPO service contains 3 major storage sub systems. The unstructured data (customer files) are stored
in Azure Blobs, the structure data (file meta data) are stored in Azure Databases and transient data are
stored in Azure redis nodes (covered in Chapter 15).

9.1 Blobs
9.1.1 Why a Blob Store?
• Storing file data in SQL creates many challenges

o SQL requires frequent (weekly or more) full backups resulting in many times DB size (today 10X) bytes in
the backup system
o SQL maintenance is driven by DB seeding time. Larger DBs take longer to seed increasing incident count
o SQL has high overhead for large var binary columns (off row storage). Up to 9k per entry.
o SQL has highly variable performance for large payloads.
o Large file support bumps into physical transaction limits (2GB)
o Large tlogs create replication challenges (latency) putting DR SLA at risk.
• As SPO grew the database backup system failed to scale physically and $$ wise. With less than 1 year to collapse
ABS was started and delivered on time to avert the crisis

• ABS is typical of successful large scale service solutions:

o ABS replaced a simpler but scale challenged solution with a more complex but scalable solution
o Engineering the more complex scalable system is a more solvable problem, and once solved it pays off
repeatedly, especially when during rapid growth.
9.1.2 ABS: An Abstracted Partitioned Blob Store
• ABS is not a thin wrapper over Azure Storage Block Blob, ABS is designed to be hosted on any compatible backend
store.
• Existing hosts are In-memory-store (for testing) and the Azure Storage Block Blob system (production)
• ABS has a defined ‘host services’ contract with the backing store that allows additional hosts to be created if/when
needed.
• ABS does not expose the host store to client. The Client API surface is identical regardless of the host (but some hosts
don’t support all features)
• Partitioning is client controlled using a supplied PartionID (string). All blobs are identified by their PartitionID/ABSId.
The PartionID was added to support clients ‘Delete Partition’ type requests.
• Immutability is provided even on mutable stores (ex:Azure Storage Block Blob).
9.1.3 ABS solution to the SQL Backup Problem
• ABS replaces backup with ‘Deferred Deletion’
• Being an immutable append only system allows simple retention to support metadata rollback at very low cost
• All blob that are to be deleted are ‘aged’ in a SQL table until all possible SQL backups that could restore a reference to
the blob are gone
• This provides logical ‘full backup’ for cost of only 1.8% additional storage per week vs 500% via SQL backup.
• The delete aging table entry (DeletedABS) is added in the same transaction that removes the reference. On DB
rollback Blobs are ‘auto’ undeleted as the rollback that restores a reference also removes the deletion table entry.
• Jobs that process the DeleteABS entries make further checks to handle aliasing and other restore workflows. Once a
blob passes checks it is deep deleted.
9.1.4 ABS v1.0 (Blob Drain)
• v1.0 attacked Database Backup size by reducing DB bytes
• Transaction size was not reduced as all blobs were written to DB first
• Background jobs moved in SQL file blobs to ABS
• Background job was effective for data written before ABS rollout and ‘ok’ for net new writes.
• Effective end to end hashing system to ensure no corruption in transit (transit from SQL to Azure Storage Block Blob

Understanding SPO v2023 57 Microsoft Confidential


Results
Original SQL Size Post Blob Drain and DB Shrink Blob size total Overhead ‘gone’

100GB 3GB 95GB 2GB

• Site size limits immediately raised from 100GB to 1TB.


• v2.0 would address transaction limits and reduce tlog sizes for better/cheaper metadata replication.

9.1.5 ABS v2.0 (Direct Write)


• Once the DB backup capacity crisis was averted the team moved on to address the other limitations, namely:
o Large file support limited by SQL Transaction size limits
o Tlog size impacting replication speed and system health
• Direct Write approach attempts to write all blobs to ABS before committing SQL update to file with references to just
created blobs.
• Option to Write to the DB was maintained and still used today
• Writing to the DB is not ideal but helps with availability and reliability.
• When V 2.0 was written SPO SQL was on large spinning disks. Plenty of extra disk space for temporary growth. With
the move to SSD, falling back to DB for blob writes had less buffer.
• Today 99.9% of file blob writes go to ABS. A small amount go to the DB based on timeout and fallback system.
• Large file support enabled, as SQL transactions for large file update are now metadata sized.

9.1.6 ABS v3.0 (Global system)


• All farms have Read creds to all Blob systems
• Metadata team can move compute/metadata as they need
• Cross region read supported for immediate site R/W after metadata move
• Automatic discovery of moved metadata with async online blob move to the new location
• Automated Global credential distribution system deploys to all (several hundred) farms
• Auto-rebalancing of storage accounts for live traffic
• Data integrity scanner running across all Dbs / partitions
• Pending-deletes pipeline support for all moves
• Blob platform (Queue / Container system built on Azure Storage)

Understanding SPO v2023 58 Microsoft Confidential


9.1.7 How ABS differs from hosts like Azure Storage Block Blob
Item Azure Storage Block Blob ABS

Availability Cluster or region issues impact Resilient to cluster down and region down for Read and Write.
availability for Read/Write Hardened against DNS failures.

Byte/TPS limit Account 5PB max/25kTPS. Single 300PB per ABS system, 1.5m TPS. Blob reads distributed across 60 Azure
cluster limited. Storage Accounts on multiple physical clusters.

Blob Id Client controlled string. Write failures ABS generated crypto random GUID string. No conflicts.
due to conflicts possible.

Encryption Service or Client library controlled. ABS implemented encryption/decryption, unique key per blob (UKPB).
Many blobs to one key model.

Blob Read/Write Single operation single blob Batch operations supported on multiple blobs (async)

Immutability Not supported. Fully supported, all writes to existing blobs fail.

Replication to DR Asynchronous no SLA Synchronous RPO=0 SLA

Partition Limit Container # limited to thousands if Unlimited Partitions. SPO has 1 billion partitions (sites) today.
container Policy used.

Read latency Driven by single request Dual read reduces 99th percentile latency by reading DR copy in parallel

Data needed to Blob name, cred to container PartitionID, ABSId, ABSInfo (opaque byte array containing encryption
read blob keys etc..), ABSLocatorId (address)

BYOK support Account level support only. Single key Partition level support. High availability using 2 key vaults in different
vault dependency. Geos with compliant caching to handle DNS/AD issues.

9.1.8 ABS System Capabilities


• High Scale: 300PB and 1.5 million TPS per system
• High Availability: 99.999% DNS hardened, Dual Read, Write auto shaping
• High Durability: 99.999999999% durability
• High Reliability: 99.999% success
• Synchronous Replication: compatible with async metadata RP=0 RTO=0
• Active-Active system with zero gesture failover
• Compliant Encryption at rest and in transit
• Security
o Least-privileged access
o Intrusion Detection
o Immutability
o Crypto-Random Ids
• Data Integrity: at scale metadata/blob system verification
• Tenant move, multi-geo move and go local move at scale worldwide
• Tenant scoped Bring-Your-Own-Key

Understanding SPO v2023 59 Microsoft Confidential


9.1.9 ABS Scale
• ABS uses pooled Azure Storage LRS block blob accounts, 60 in each region for 120 per system
• These pooled accounts provide:
o 300PB of storage
o million TPS per region
o Fully active-active RP=0 RTP=0. SPO farms in both regions are R/W against single system
• ABS automatically balances writes across the pool to prevent ‘hot’ accounts.
• Multiple SPO scale units (farms) are supported by each account pool.
o Prevents resource islanding.
o Provides smoother traffic ramp (useful for Azure Storage dynamic compute scale out)
o Simpler management with 10X fewer pools than SPO farms.

The figure below shows the Primary and DR pool approach for ABS providing 300PB of storage and 1.5million TPS.

Figure 9-1 SPO blob storage system

Understanding SPO v2023 60 Microsoft Confidential


9.2 Databases
As of 2022, SPO hosts around 200,000 databases in SQL Azure services, including the GeoDR copies. SPO
code communicates with databases through a common layer of libraries and utilities which are shared
by many processes and components. This section will go over the key concepts related to SPO
databases.

9.2.1 Database roles


As shown in the following figure, every SPO content farm has one Config database, which stores farm
configuration information such as server names, timer jobs schedule and history, etc. Each farm also has
one SiteMap database, which stores mapping from site url to content databases. The customer files are
stored in blobs, with reference ids stored in the content databases. Content database contains other
metadata of the customer files, such as last modified time, author’s name, etc.

Figure 9-2 SPO database architecture

There are two types of content databases: shared content database and dedicated content database. A
typical tenant has all its sites in a single shared content database, that is shared with other tenants.
When a tenant grows big, it is isolated into its own dedicated content database, that is not shared with
other tenants. An isolated tenant can occupy multiple dedicated content databases.

When a database grows too big, determined by thresholds such as site collection count or document
count, it will be automatically split into two smaller databases.

Understanding SPO v2023 61 Microsoft Confidential


Every SPO farm keeps a different set of keys to access the databases. These keys are rotated every 90
days.

9.2.2 SQL Azure SKUs


A typical SPO farm has several hundred content databases. To use Sql Azure resources more efficiently,
we put content databases into elastic pools. Each pool can host multiple databases. Content databases
with high load are hosted in standalone mode, instead of an elastic pool, so that they don’t use up all
resources and impact neighboring databases.

Config and SiteMap databases are considered “single point of failure” databases for the farm, so they
are also hosted in standalone mode, so as not to be interfered from content databases.

SPO uses Sql Azure vCore purchase model, which allows the service to scale CPU and storage
independently. A typical SPO elastic pool has 10 vCores, and can go up to 80 vCores. SPO hosts vast
majority of the databases on the Business Critical Service Tier, which keeps 4 local nodes and provides
high IO bandwidth, a hot standby node and a second node dedicated to readonly access. The RO node
was used extensively in SPO service to offload load on the primary node. Very few Databases with low
IO demand are hosted in General Purpose service tier, which has the advantage of being more cost
effective.

Figure 9-3 How SPO databases connect to SQL Azure

Understanding SPO v2023 62 Microsoft Confidential


9.2.3 Database capacity modeling, resource rebalancing and auto-heal
Since SPO runs hundreds of thousands of databases, it is critical for there to be a process to monitor and
balance these resources for optimal consumption. These management gestures handle two types of
traffic increases: organic growth over several weeks, and sudden spikes that need immediate attention
within minutes.

To handle organic growth, we collect the CPU, worker thread percentage and other metrics from every
standalone database or elastic pools. We average these metrics by 15 minutes spans, and compute a
“score” based on top 15-minute spans from the past seven days. An elastic pool with a score of 60%
means usage from the peak 15-minute span of this pool is at about 60%.

During off-peak hours, we do resource rebalance, trying to keep pool score within optimal range. The
range used currently is (40%,75%), but that is subject to change. We split an elastic pool if its score is too
high and contains too many databases; upgrade the elastic pool if its score is high and doesn’t contain
too many databases; eliminate the elastic pool by distributing databases in the elastic pool to other
elastic pools; and downgrade the elastic pool if its score is low and cannot be packed away.

During daily rebalance, we try to keep elastic pool vCore count between 10 and 20. vCore count too low
will not be efficient and may not have the power to handle peak hour demand. Too big a pool also has
its problem: upgrade/downgrade takes longer, because more data needs to be copied from one node to
another node. This will cause problems during an incident because we will not be able to scale fast
enough. Furthermore, too many databases in a pool may also introduce session limit problems.

To be able to respond to sudden spikes quickly, we installed multiple monitors. Our monitors fall into
three categories: active monitors, which actively send requests to each database to make sure they are
alive; Azure alerts, which monitors database metrics such as CPU% and worker threads% from Azure
side; and passive monitors, which tracks QoS data based on end user traffic results, such as key stored
procedure latency and errors. These monitors will generate alerts. These alerts are hooked up to a
response system called Diagnostic Scripts, which allows custom code to be run to automatically heal the
database. Depending on the situation, auto healing actions may decide to increase storage, or do SQL
node failover, or upgrade/split the elastic pool, or failover the database to its geo-secondary.

Understanding SPO v2023 63 Microsoft Confidential


9.2.4 Database telemetry and query attribution
The following figure shows how SPO collects SQL telemetry information and does query attribution.

Figure 9-4 SPO Database Telemetry processing

The vast majority of the SQL query execution in SPO goes through a class called SqlSession, which is
shown as the green box in the top left corner. Among other things, this class manages connection pools
to ReadWrite and ReadOnly instances of the same SQL database. It also tracks connection speed and
could decide to throttle requests when the SQL connection becomes too slow. This class also logs the
query information to be uploaded to COSMOS store later for further aggregation and analysis. The SQL
log includes the Id of the caller app and many other useful information which can be used for cost
attribution. Once this information reaches the COSMOS store, a daily job will generate SQL telemetry
and attribution reports, which can be used to further analyze SQL usage in depth. This aggregation
process is represented by the blue boxes in the figure above.

The orange box in the figure above listed several important timer jobs which enhance and utilize the SQL
telemetry. The Sproc attribution job and SqlQuery attribution job are needed because the SQL log
alone does not give the complete picture of SQL usage as it contains only the stats seen by the SQL
client. These two jobs enhance the data by querying the database DMV table to obtain SQL server
statistics such as milli-seconds spent in the CPU core and associate them with the SQL query executed.
The Sql Metrics Collector job collects CPU/worker percent etc. metrics every 10 seconds and saves the
data to Redis cache and Geneva MDM. This data is shared by all front-end machines and can be used to

Understanding SPO v2023 64 Microsoft Confidential


make database throttling related decisions. The SqlDiagnostic job uses the telemetry collected and
periodically kills queries which blocks too many other queries, to keep databases running smoothly.

Many more information related to SPO database usage attribution and database performance stats
could be found at http://spo-rt/SqlCPU and http://spo-rt/SqlPerf. The following is a sample chart
showing the attribution of SQL usage from different applications on a given day.

Figure 9-5 SPO Database usage attribution sample

Understanding SPO v2023 65 Microsoft Confidential


Chapter 10 Internal workloads
SharePoint Online requires a mechanism to run tasks necessary to provide its services. These tasks
include (but are not limited to) the following categories:

• Deferred functionality - that can run asynchronously as a follow up to a user / admin action
• Tenant admin provisioning / update actions
• Security and Compliance - Information Protection, Antivirus Scanning, …
• Periodic computation of user visible quotas etc.
• Periodic optimization and maintenance - on object in the content database / blob storage
• Integration / sync - user profile data, webhooks, content types, …
• Migration
• Monitoring and alerting

To run these tasks, SharePoint Online has its own scheduled tasks management service, manifested as
Timer Service instances installed on the BOT and USR VM roles in content farms. Most of these tasks are
carried out on the BOT role, but some execution happens on the USR role as well. If the Timer Service or
any of its instances begins to malfunction, it will not take long for problems to begin appearing across
the farm.

Another critical workload that needs separate discussion is Search Crawl - which responds to create,
update, delete operations on user content and pushes changes to the M365 search index. This workload
runs in its own processes (independent of the Timer Service) on the BOT role within all content farms.

10.1 Need for a separate VM role


Of the various workloads we discuss above, Search Crawl traffic and resource consumption (CPU,
memory, SQL, network) patterns correlate positively with API traffic on the respective farm. Search
workloads are CPU intensive.

Timer jobs are units of functionality encapsulated within .NET classes and authored by various
engineering teams within (and outside) SharePoint. Today, we have more than 300 active timer jobs.
These vary in terms of schedule, resources they operate on and their CPU and memory resources
consumption. Unlike Search, timer jobs usually take up to 10% of CPU on a loaded BOT machine but are
memory intensive. They also contribute to about 10% of the downstream COGS on Azure SQL.

To avoid internal workloads' resource consumption inadvertently impacting user facing web and API
traffic, they are largely restricted to running on the BOT VM role. The USR to BOT VM ratio in a content
farm is 2.5:1. Additionally, BOT VMs are allocated half the CPU, memory, and disk as USR VMs resulting
in a 5:1 resource allocation ratio between the 2 roles.

The above separation allows BOTs to run at 100% utilization while USRs need to be kept at 65%
utilization to provide for spiky traffic.

Understanding SPO v2023 66 Microsoft Confidential


10.2 Timer Jobs
As explained above, timer jobs are defined as .NET classes in the SharePoint online codebase which
describe the actions it will take. All timer jobs share a common superclass ancestor - SPJobDefinition and
implement their functionality by overriding the Execute() method.

10.2.1 Lock Types


All timer jobs have a scope of action - an object type upon which they are intended to work. This scope
may be the farm, an individual VM, or a content database. That is, for each iteration of a job, it is
expected to execute once per farm, once per target VM, or once per content database.

These are declared in the lock type of the timer job: Job, None, and ContentDatabase.

Lock Type Description


Job The job runs only once per farm per iteration
ContentDatabase The job runs once for each content database. This lock type is ideal for most
scenarios that need to process an entire content farm.
None The job runs once per target VM (USR / BOT) instance in a farm. Each of these
instances will handle every Content DB, with no locking. This is ideal for work-
item based scenarios where time to pick up needs to be near instant, as the
multiple threads hitting all Content DBs should ensure something picks the work
item up near immediately.
The reason None is an appropriate description for a VM-specific lock is that there is exactly one Timer
Service Instance per VM. When each timer service instance runs a job without checking or taking any
locks, the result is that the job is run once per VM. So "no lock" or None is the logical equivalent of a per-
server lock.

From 2022, there is another option available. Assignments are a feature designed to replace locks.
Rather than being limited to one of three thread models (1 per set of Content DBs, 1 per Timer service, 1
per farm), assignments allow job owners to customize their thread model. Assignment definitions
specify the number of threads to run per target resource, the number of Timer service instances those
threads should be spread across, and the maximum number of threads a Timer service instance should
run at a time. The target resource is usually, but not always, a content database. Assignment definitions
can also specify filters for resources and service instances. For example, an assignment might only run
against DBs that are not in read-only mode and might only run on BOT machines.

10.2.2 Schedule
The schedule determines how often a job will run. Using a schedule allows jobs to run at a desired
period as opposed to running continuously and checking if work needs to be done. Jobs share resources
like threads, network bandwidth and SQL cost with other jobs so authors are responsible to ensure that
their job does not run any more frequently than necessary.

The shortest schedule allowed is every 1 minute. The timer service introduces jitter into job execution
by randomizing execution within the limits of the specified schedule. This prevents jobs of a certain type
from starting at the same time and overloading downstream resources like the content DB.

Understanding SPO v2023 67 Microsoft Confidential


10.2.3 Work Item Type Jobs
Jobs that derive from SPWorkItemJobDefinition support processing work items queued into the Content
DB. The job definition allows for specifying a unique ID that can be used by producers to queue a work
item. The job can then fetch queued work items by invoking a stored procedure with its unique ID.

From 2021, the timer service can prefetch work items for opted in jobs based on their schedule, thus
reducing multiple jobs hitting the content database. It also optimizes job invocation by skipping those
work item type jobs that have nothing to process.

Work item type job owners are responsible for enqueueing at a reasonable rate. This depends on the
scenario and/or SLA, as well as the rate of dequeuing/processing. If enqueuing rate exceeds dequeuing /
process rate, the work item queue builds up and causes service degradation on the farm. Since 2021,
work item jobs are throttled at the entry-point and producers needs to be able to handle the
corresponding failures. The queue size threshold is set per work item type and current default is 30M.

10.2.4 Pausable Jobs


Inheriting from SPPausableJobDefinition (or derived classes) allows jobs to support a graceful pause
when the timer service restarts either on schedule or due to a deployment. Pausable jobs can choose
their strategy to handle pause requests either by abandoning its work early or saving its current state to
be picked up on resume.

The timer service passes a state object to pausable jobs in the Execute() method, persists the state upon
pause and restores it upon resume.

10.2.5 Documentation
If you are interested in learning more about the Timer Service, how to author your own job as well as
best practices for authoring, the Timer Service Wiki is the best place to start.

10.3 Content Push Service / Search Crawl


Search Crawl (also known as the Content Push Service) is the other significant workload running on BOT
machines. SPO content service and the search service are loosely coupled. Both can scale independently
and perform gestures with a high level of isolation.

Content Push Service (CPS) was introduced in 2019 which replaces the old Search Crawler from the
Search Farm. CPS runs on the BOT roles in content farms and the crawl state is stored in the Content
DBs. The CPS service is tightly integrated with SharePoint and introduced the concept of scenario-based
priority queues to enable High Visible user changes to be pushed in seconds. CPS submits updates to an
Azure service owned by the FAST team called SCS (Search Content Service).

Search Content Service (SCS) launched in 2014. This was a key project and directional shift that
disconnected the crawler from the search farm. This was achieved by the crawler submitting content to
SCS, and the search farms pulling content from SCS. This enabled warm stand-by for search farms, dual
indexing on PR & DR. SCS also became key for content routing for ingestion into Substrate. SCS is now
one of the largest Azure services.

The completion of Project Greenland means that Search Content Service (SCS), currently an integral part
in distribution of search data to search farms, would become more akin to a persistent Substrate queue,
with much similarity and overlap with the data structures of the ODSP Content Push Service (CPS).

Understanding SPO v2023 68 Microsoft Confidential


Project Vinland is a multi-year MAP project to rationalize these capabilities to lower COGS and simplify
maintenance. At the completion of Project Vinland, we will retire the SCS service and CPS will submit
content updates directly to Substrate.

For more details about Content Push Service see our wiki here: Content Push Service (CPS) - Overview
(visualstudio.com)

Understanding SPO v2023 69 Microsoft Confidential


Chapter 11 Tenant Life Cycle
A tenant’s lifecycle in SPO begins when its information is initially “synced” into the farm, followed by
“provisioning” of core SharePoint sites based on that information.

11.1 Forward Sync from MSODS


11.1.1 What is forward sync?
As was introduced in Chapter 8, every SharePoint Online network is associated with at least one MSODS
service instance (SI). See Chapter 8 “SharePoint Online Directory Service (SPODS) for a more thorough
introduction. In order to keep the local copy of that service instance’s data in sync and up to date with
the source of truth in MSODS (Microsoft Online Directory Service), MSODS supplies a set of APIs that can
be called to obtain a constant stream of changes that have occurred on that SI (service instance).
Anyone interested in in-depth details can see the MSODS Sync Service wiki here, or their spec document
here.

There is a timer job that runs in the SPO timer service (owstimer.exe) called SyncToAdTimerJob_Sync.
This job is a singleton job, meaning it only runs on one BOT machine at a time in each content farm. At a
high level, the job is responsible for calling the APIs supplied by MSODS, transforming the results into a
form that can be stored in SPODS, and writing those changes to SPODS. The main API used to get the
next set of changes is called GetChanges. This returns a set of changes that must be applied to SPODS, a
flag indicating if there are more changes available in MSODS (more=true or more=false), and a cookie
that must be used on the next call to tell MSODS where we are in the stream.

11.1.2 How does forward sync work?


The cookie returned by GetChanges is some serialized object from the MSODS side – from SPO’s point of
view it is an opaque byte array. It is the job of forward sync to ensure that all changes are correctly
persisted to SPODS, and in order to do so there is particular logic related to when we can save the most
recent cookie (which is also stored in SPODS). Namely, we can only save a cookie in SPODS if we have
successfully written all of the changes from the batch associated with that cookie, and all of the changes
from batches before the it.

Errors happen, and sometimes changes cannot be applied immediately to SPODS. It is important to
continue making progress in the sync stream however, so the SI does not build a backlog on the MSODS
side or create a delay where customers notice that changes to their tenant/users/groups/etc. are not
propagated to SharePoint Online. Therefore, forward sync has a concept of a recovery queue, which is
also persisted alongside the cookie in SPODS. Items in the recovery queue can be processed by using an
API called GetDirectoryObjects, which will return the entire full state of that object.

For more details (which are out of scope for this document) on how forward sync optimizes processing
of the sync stream, handles the recovery queue, queues full tenant sync requests, or handles tenant
moves, feel free to reach out to [email protected].

11.1.3 SharePoint Online Directory Service (SPODS)


SharePoint Online Directory Service’s goal is to maintain a local copy of the data from AAD. Maintaining
that local copy enables SPO to be resilient to outages in AAD and enables us to meet various
performance targets. As a service, our goal is to ensure the physical storage is abstracted away from
strong contracts. This contract currently manifests as a client-side component that has inherent

Understanding SPO v2023 70 Microsoft Confidential


knowledge of how to talk directly to the storage system. The storage system is in a state of transition –
moving from an Active Directory based solution towards an Azure SQL based solution.

11.1.4 Active Directory SPODS


Active Directory (AD) is a core service within a Windows ecosystem in many organizations. AD provides
an organization with the abilities like managing user identities, groups membership, and support for
security authorization checks. Given that MSODS maintains an AD like structure internally and that
SharePoint was already integrated with AD, the logical choice was to utilize AD as the storage system for
SPODS which we refer to as Active Directory SPODS (AD SPODS).

As an Active Directory deployment, AD SPODS utilizes the Lightweight Directory Access Protocol (LDAP)
which is an open, vendor-neutral, standard application protocol that was designed for interacting with
directory stores. Within the LDAP protocol, each object contains a set of attributes. Every object also
maintains a unique identifier called a Distinguished Name (DN) which enables one to find the object
similar to a full file path. Objects can be nested within other objects thus allowing one to build a tree-like
structure. Every tenant maps to a top level object called an OrganizationUnit (OU) object. Within each
OU object, we store all of the directory objects that belong to that tenant – things like user identities,
group objects, device information, etc. – grouped within appropriate subtrees.

AD SPODS requires special hardware to run, specifically there is a DS farm in every network with a set of
virtual machines called SPDs. See section 4.3 for a description of the various VM roles. The SPD machine
role is one of several machine roles in SPO. One of these SPD VMs is chosen to be the “primary SPD” and
is responsible for accepting writes from major workloads such as fwdsync and provisioning, and those
writes are replicated out to the other SPDs using AD technology. Reads for flows like AuthZ can be
performed against non-primary SPDs.

11.1.5 SQL SPODS


Running an Active Directory deployment comes with a cost. There are inherent complexities with
managing special hardware. With growth, controlling sizes and various limits are challenging, and we
require engineers with specific AD-related knowledge. Tangentially, with SharePoint Online’s move to
Virtual Machine Scale Set technology (VMSS) it was determined that utilizing an Active Directory
deployment within that environment was not possible. As a result, we have started to transition the
fleet towards a SQL based storage solution which is referred to as SQL SPODS.

SQL SPODS is implemented as a SQL database (Directory DB) with several tables for each type of entity
that is stored in SPODS, such as Tenant, User, Group, etc with support for group membership expansion.

For more details or questions about SQL SPODS, you can reach out to [email protected].

11.2 Provisioning Tenants and Users


Provisioning is the act of taking customer information, creating a tenancy for the customer, and creating
core SharePoint sites for the tenant that allow the customer to create and collaborate on information.
First, let’s define a tenant in SharePoint Online:

A Tenant is a representation of a customer in SPO and consists of 3 main parts – Identity, Metadata and
Content.

Understanding SPO v2023 71 Microsoft Confidential


Identity:
Every Tenant has a unique identifier, a globally unique identifier (GUID), called the CompanyId. The
SiteSubscriptionId has the same value as CompanyId for newer tenants. The CompanyId may also be
called ContextId in some AAD sync components, and TenantId in partner systems like Commerce. The
tenant is a container for the users, groups and membership identities in AAD. Every user and group
have their own unique identities:

• User:
o UserPrincipalName (or UPN), similar to an email address, used for login, may change
over the lifetime of the user.
o Passport User ID, PUID, unique immutable ID (long/hex), never changes for the lifetime
of a user. Enterprise generally keys off of UPN or ObjectID, but you may come across it
in the code. Consumer auth tokens often surface the PUID and CID for identification.
o ObjectId, a GUID that uniquely identifies this object. This is assigned by AAD.
o Other metadata like email, phone, etc.
• Group
o Alias, a name for the group.
o ObjectId, a GUID that uniquely identifies this object. This is assigned by AAD.
o MemberOf and Member metadata, indicating direct group membership.
• Membership
o This is basically a “mapping” of group to it’s members, who can be users, or other
groups themselves (this creates a Directed Acyclic Graph of dependencies). Membership
is part of the group object.

Metadata:

• AAD owned metadata: The tenant has a bunch of attributes that determine how it should be
processed in SPO. It can have the list of Assigned Plans (Standard, Enterprise),
VerifiedDomainName (contoso or fabrikam, etc.), Purchased license counts, and so on. This part
of the metadata is mastered in AAD (Azure Active Directory).
• SPO owned metadata: The location of the tenant – SPO Farm(s), SPO Database(s), DNS and
Routing information, Site subscriptionId that identifies all sites that belong to the tenant (this is
usually the same as the companyId), State of provisioning, Workflows that are currently being
processed on the tenant, and so on.

Content:
These are the actual sites (and all their content) whose siteSubscriptionId == that of the tenant. This can
be spread across multiple databases and even multiple content farms.

This tenant representation is spread across multiple locations in SPO:

• TenantStore – this is the master of all SPO information of a Tenant. It is stored in the first
“default” site created to hold the Tenant’s representation, also called the “FunSite” (short for
Fundamental Site).
• Grid – GridManager (GM) is a component that manages SPO topology, and is the master of the
location information of the tenant – the Farm and Database mappings, clumps, and so on. These
must be in sync with the Content database where the tenant’s funsite exists. Grid also has

Understanding SPO v2023 72 Microsoft Confidential


request router / DNS related information to route traffic correctly to the farm/DB where the
tenant exists.
• SPODS or SharePoint Online Directory Service – this is a copy of the AAD owned metadata of
this tenant, sync’d from AAD via Forward Sync. The main purpose of maintaining this is to help
AAD scale well by routing traffic to SPODS when possible.
• ContentDB – Has the sites of the tenant, including funsite. A content DB may be shared by
multiple tenants, but we make sure a tenant never sees another tenant’s data since we use the
SiteSubscriptionId to logically partition out a space for the tenant (aka Shard).
• SiteMap (future, Global Lookup Service or GLS): This contains a mapping of the URL or ID of a
site to the actual location (contentDB, SiteId) of that site. In future, this lookup will move to a
new component called GLS.

[This is evolving and information here may be out of date. Talk to specific area owners for details.]

Now, let’s understand how the tenant is actually provisioned in SPO. Many components are involved in
this process, as described in the next 3 sections:

11.2.1 Tenant subscription to M365

• End customer subscribes through one of our commerce options (portals where the customer
purchases licenses) and depending of the customer Geo location it’s assigned to one of
M365 Data Centers
• Global M365 tenant registry for M365 is MsoDS (Microsoft Online Directory Service)
• A part of MsoDs associated with a specific Geo location/Data center is called “M365 Service
Instance” (or just “Service Instance”/SI)
• SI is a primary data source for M365 components: EXO (Exchange), Yammer, SPO, etc.

11.2.2 SPO Traditional Provisioning Flow


At this time, all of the data for a given tenant in a given SI is located within a single content farm, which
is part of a Grid Network. The network today also has a DS (Directory Services) farm. Going forward the
DS Farms are moving to SQL Storage, but that’s independent of the discussion here.

Understanding SPO v2023 73 Microsoft Confidential


The DS farm is essentially an internal SPO copy of MsoDs. In the picture, we show multiple DS Farms per
network, but we have since moved towards 1 DS Farm per network as our standard topology. There’s
always 1 content farm in a network.

• “SNC” (aka “Forward Sync”) is a component responsible for communication between SI and SPO.
It’s represented by a farm-leveI SP timer job. It:
• #1: Pulls recent changes from SI
• #2: Communicates with GM in order to define which DS farm this change should go (for new
entries – fulfills load balancing role)
• #3: Stores incoming change in appropriate DS farm
• #9: Pulls data from SpoDs that is ready for publishing back to SI
• Prepares publications and publishes to SI
• By definition a NW can only have a single Snc. In case of Active/Active Farm configuration,
exactly one of the active farms can run Snc.
• “Prv” – is a timer job that services provisioning requests from a single DS farm. It:
• #4: Pulls entries from a DS farm that require a provisioning action. Such as: onboarding; change,
lock out, deprovisioning.
• #5: Communicates with Grid Manager (GM) in order to detect where affected tenant is resided.
• Schedules an async workitem to be executed by Tenant Workflow Engine.

Understanding SPO v2023 74 Microsoft Confidential


• “WFE” – Tenant Provisioning Workflow Engine. It does all internal SP job to reflect provisioning
request for a tenant.
• Workflows that are executed through this pipeline:
o ProvisionTenant – onboarding requesft for any tenant.
o Restamp – change request (e.g., when the tenant buy new licenses, or has a name
change, etc.)
o DeleteDNS – Lockout request (when the tenant stops paying for SPO and is suspended
for 30 days and we are now in the last stage before eventual deletion).
o DeleteTenant – deprovisioning request
• #7: If WFE execution requires a Publishing back to SI – it’s done from WFE – see 7
• Prv also is responsible for users provisioning and does it synchronously upon a request receiving.
Once it completes it updates user entry in SpoDs, marking them as ready for publishing back. As
discussed above, Snc fulfills such publishing via 8 & 9.
• MultiGeo tenants have 1 instance of the tenant per GEO (EMEA, US, FRA, etc.), and each instance
is associated with 1 SI and lives in 1 Content-Farm. These instances can live independently but are
managed together by AAD, since they all share the same Tenant CompanyID but have a different
Tenant InstanceID per GEO. Details about this is beyond the scope of this document.

11.2.3 SPO Instant-On Provisioning Flow


This is a fast provisioning path that bypasses forward sync delays, prv pickup wait times and the
asynchronous workflow engine execution delays, to help new tenants signing up for SPO get a great first
run experience. Instant-On has 3 main parts:

• In the sign up flow, Commerce directly enqueues the tenant provisioning package (that includes
all the “properties” of the tenant that Prv/WFE would need to provision the tenant) into Azure
Service Bus (ASB).
• Then we have a timer job that runs in some USR and BOT machines in the content farm that
directly pulls those ASB packages and provisions the corresponding tenants synchronously in-
process with no queuing, and publishes information back to MSODS.
• In addition, we also call into (that have 3 minute timeouts) to invalidate any tenant status data
so that Office UX caches get the latest information from AAD for a tenant that’s been
provisioned through Instant-On.

During this process, we make sure that if the tenant happens to land via forward sync faster than
Instant-On (it can happen, though rare), they don’t stomp on each other and we avoid races.

Instant-On is a best effort basis for new tenant provisioning, It does not handle other lifecycle events like
changes or deprov: they still flow through forward sync. We ship a Nuget package that our partners can
download and use to communicate through ASB and send us provisioning requests. This is used by
Commerce today and we may offer this to other partners if there’s a business need (e.g., GoDaddy,
other resellers, etc.).

Understanding SPO v2023 75 Microsoft Confidential


11.3 Multi-Geo
Multi-Geo is a multi-tenant instance SharePoint environment. It provides customers with the ability to
expand their presence to multiple geo locations within a single existing Tenant. Multi-Geo enables
customers to manage their data locations at a granular level for SharePoint sites, users and groups. It
allows customers to store their data in multiple geographies to satisfy their data residency requirements
and move them as per change over time.

Multi-Geo is needed for following reasons:

• Data security, every country needs their data to reside in their geo location to avoid data
crossing their borders.
• Turnaround time of the request will be lower improving the performance and user experience.

The following are key terms related to multi-Geo:

• Tenant – Customer's representation in SharePoint Online.


• Geo location – Multiple regions or instances associated with a multi-Geo tenant.
• Preferred Data Location (PDL) – A property set by the Azure AD administrator for the user or
group object that SPO uses to provision corresponding data-at-rest resources like OneDrive,
group sites etc.

Understanding SPO v2023 76 Microsoft Confidential


11.3.1 Architecture
Instances in different geo locations will have their own Tenant Instance Id mapped to the geo location.

Tenant Instances in multi-Geo environment is divided into two categories based on the geo locations:

1. Default Instance – Tenant instance in the geo location where tenant subscription was originally
provisioned.
2. Satellite Instances – One or more tenant instances in the geo locations configured by Tenant
Administrator to satisfy their data residency requirements.

The tenant instance Id with geo location mapping is stored in the tenant store.

Multi-Geo is currently offered in the regions mentioned here.

11.3.2 User Provisioning


Each user has a Preferred Data Location (PDL) which denotes the geo location where the user’s personal
data (Exchange mailbox and OneDrive) along with any Microsoft 365 Groups or SharePoint sites that
they can create are stored to meet data residency requirements.

Each users OneDrive can be provisioned or moved by an administrator to a satellite location according
to the users PDL. Personal files are kept in that geo location, though they can be shared with users in
other geo locations.

Users are created by administrator from Microsoft 365 Admin Center. Users OneDrive is created
automatically at the time of their first login. By default, Users are provisioned in Default Tenant location,
but they can be provisioned in Satellite location as well by setting PDL before their first login.

11.3.3 Move a SharePoint site to a different geo location


SharePoint site geo move allows customers to move SharePoint sites to other geo locations within their
multi-geo environment.

The following types of sites can be moved between geo locations:

• Users OneDrive for Business.

Understanding SPO v2023 77 Microsoft Confidential


• Microsoft 365 group-connected sites, including sites associated with Microsoft Teams.
• Modern sites without a Microsoft 365 group association.
• Classic SharePoint sites.
• Communication sites.

SharePoint site geo move is an operation initiated by Tenant Administrators by connecting to the
SharePoint Admin URL.

Site Move workflow of a SharePoint site involves 5 phases:

1. Initialize – When the move operation is initiated by Tenant Administrator then it is added to
Pending Move SPList (PDLChangedList) in the Initialize phase. OdbMoveSchedulerJob Timer job
runs every 5 mins and picks up the entries in PDLChangedList and queues them to execute
remaining phases of the workflow.
2. Backup – OdbMoveSchedulerJob after picking up item from PDLChangedList processes the move
job by starting the backup phase. In the backup phase the site state is changed to ReadOnly to
avoid any updates and its metadata is backed up into Azure Blob and then a cross farm API call is
made to start the restore operation in the target farm. An entry is added to UserMoveWorkList
to initiate the Move Job.
3. Restore – In the restore phase the site is created on the target farm with the same properties as
in the source farm. After the restore workflow is finished then the target farm makes a cross
farm API call to source farm to notify the completion of restore phase.
4. Cleanup – In the cleanup phase the original site is deleted in the source farm and a new redirect
site is created to redirect any requests to the older url to the new site in the target farm.
5. Finalize – In the final phase of the Site Move workflow the move job entry is added to
UserMoveCompleteList and removed from both the PDLChangedList and UserMoveWorkList to
denote the completion of the move operation. It also makes a cross farm API call to move the
item to UserMoveCompleteList in the target farm. Entries in the UserMoveCompleteList is
automatically cleaned up after 180 days.

Understanding SPO v2023 78 Microsoft Confidential


Each of these Site Move workflows create several intermediate work items to perform various actions.
These work items are stored in Content Database and Tenant Workflow timer jobs picks them and then
processes them by a Tenant Workflow thread.

11.3.4 Multi-Geo Tenants Metadata Management


Properties of a tenant in SharePoint Online are multi-Geo aware, i.e., Metadata that is defined for the
default geo location of a multi-Geo tenant automatically replicates to the tenant’s satellite location.

• Cross Geo Tenant Store Replication – Tenant store is used to store tenant related settings. For
multi-geo specific properties Cross Geo Tenant Store is used.
SPCrossGeoTenantStoreReplicationJobDefinition timer job is used to replicate cross geo tenant
store properties across different tenant instances.

Normal tenant store are name value pair like

Name - random2
Value - 98659480-c632-4ebc-8d0d-77f4c0c49ae5

Cross Geo Tenant Store property follows a specific naming pattern xgeo:[key]:[geo]:0e6c74c1-
9920-47ab-81bb-80a268fcabda. Name value pair will be like

Name - xgeo:georegularsitecount:can:0e6c74c1-9920-47ab-81bb-80a268fcabda

Value - {"Value":"41939","LastModifiedTimeInUtc":"2022-12-
24T11:15:38.0941725Z","IsDeleted":false}

• Taxonomy Replication – A TermStore contains zero or more Group objects, which are used to
organize Terms within TermSets. MultiGeoTaxonomyReplicationJobDefinition timer job is used
to replicate term store settings from default tenant instance to all the satellite instances.

11.3.5 Features using multi-Geo Tech


• Consumer Data Residency – Consumer Data Residency uses multi-Geo tech to move consumers
personal data based on the access pattern in the last 6 months.
• Cross Tenant OneDrive Site Move – With Cross Tenant OneDrive Site Move, customers can
move users OneDrive to a different tenant within Microsoft Office 365.

Understanding SPO v2023 79 Microsoft Confidential


Chapter 12 Capacity management and Disaster Recovery
12.1 Compute capacity management
SPO capacity management includes areas of Compute, SQL, Blob, Redis and network capacity which
must all be aligned to support SPO services. This section focuses on compute capacity management.

12.1.1 Compute capacity life cycle


SPO compute capacity (bare metal) goes through the following stages in its life cycle.

Forecasting and order new hardware

SPO compute capacity forecasting is based on the following data:

1) Projected demand: DAU Forecasting of user growth for each geo, measured by DAU (Daily
Active Users)
2) Engineering efficiency: COREs/KDAU (cores per 1000 Daily Active Users) for each geo and each
hardware SKU
3) Usable capacity per zone pair (hardware order unit) for each hardware SKU

As explained in 3.1 Physical Topology, a zone consists of a fixed number of servers on a fixed number of
racks for each hardware SKU, connected via an Agg Router. To support SPO DR (Disaster Recovery), we
always order zones in pairs with each zone pair located in two different data centers physically apart
from each other.

Based on 1) and 2), we can calculate the number of cores needed each month for the next 12+ months.

Based on core forecasting and 3), the number of zone pairs are calculated for each geo to determine the
hardware order for the next fiscal year. Capacity PM (Product Manager) will order new hardware
including 6 months buy ahead buffer (reflected in dashboard of Incoming hardware).

Forecasting and hardware order also includes the capacity needed to replace the old hardware to be
decommissioned.

New capcaity going live

Understanding SPO v2023 80 Microsoft Confidential


After hardware ordering, new capacity goes through the following phases to be available for SPO
services:

Hardware Dock RTEG RTGM Capacity Allocation Farm deployment

• New hardware (zone pair) dock: hardware lands in Microsoft data centers
• RTEG (Release to Engineering): Azure team will finish the networking and other basic setup of
the zone pair and hand over to SPO
• RTGM (Release to Grid Manager): SPO Fleet Management team will finish configuration of the
zone pair and get it ready for new farm deployment
• Capacity allocation: SPO capacity team decides what the new zone pair capacity will be used for
o New farm deployment
o Standby farm deployment as network move target
o GridManager and infrastructure farms migration
• Farm deployment: SPO deployment team will deploy the farms based on capacity allocation
requirements and release the new farms as Sev1 enabled farms

Open farm open/close

After new farm hand off from farm deployment team as Sev1 enabled farms, capacity team’s farm
open/close automation will open the farm for new tenant provisioning and use the farm for tenant
move target as well. When the farm utlization reaches a predefined threshold, farm open/close
automation will close the farm to block new tenant provisioning to this farm. We can also manaually
open/close farms for new tenant provisioning and exclude particular farms as tenant move target to
meet special business needs.

Capacity balancing and GoLocal moves

To effectively utilize compute resources and meet GoLocal business needs, we move tenants across
farm labels (stamps) within geo for load balancing and across geos for GoLocal requets. These will be
discussed in-depth in 11.1.4 Capacity balancing and tenant move and 11.1.5 GoLocal moves.

At this stage, we also have an Auto Capacity mechanism to help maintain active VMs to meet the
expected farm goals to be discussed in 11.1.3 Content farm size definition and auto capacity.

Decommission and warantty management

SPO bare metal hardware lifetime is 5.5 years by default. The end-of-life (EOL) date of a zone is based on
zone In-Service-Time (start time in service) plus 5.5 years. For a zone pair, the In-Service-Time of the two
zones usually are very close. We use the earlier one to determine the EOL date of the zone pair.

Here are the key tasks for decommission:

• Zone pair EOL management and hardware warranty management


• Capacity planning and allocation for decommission (Compute, SQL, Blob)
• Moving tenants from old zone pairs to new zone pairs using tenant move or network move
• Old farm teardown

Understanding SPO v2023 81 Microsoft Confidential


• Grid Manager and Infrastructure farms migration (if hosted in the zone pairs to be
decommissioned)
• Zone pair logical decom and physical decom

In special situations when we have to keep old zone pairs for an extended period of time, there is a
process to extend the hardware warranty and update the zone pair EOL date to a later time. This is part
of EOL and warranty management. Normally we do not extend zone EOL to more than 6 years.

12.1.2 SPO compute capacity utilization


SPO compute capacity is mainly allocated for:

• Content farms: USR and BOT VMs


• SPODS farms: DS VMs
• Grid Manager / Reginal Manager farms: GFE VMs
• Infrastructure and InfraCore farms: infra roles such as DSE listed in 3.3 Virtual Machine Topology
• Search farms (being migrated to Substrate)

The overall capacity utilization can be visualized in the following diagram.

From Rebecca’s 2022-09 Capacity Monthly Review.pptx

Sellable capacity is the compute power measured by USR VM cores that we can use to support user
load.

“Buffer” is the 6-month buy ahead buffer as part of capacity ordering process to deal with uncertainty of
hardware arrival delay. It is also used to handle potential perf regressions. The “Actual” part refers to
the USR cores used for the current user load. “Waste” is the capacity not being used yet excluding the
buy ahead capacity. “Buffer” + “Waste” is the capacity available for future user load growth. Overall,
“Buffer” + “Waste” + “Actual” is the sellable capacity that we can use for customer traffic.

Engineering Reserves is the remaining capacity allocated for SPO internal infrastructure (BOT, SPODS,
GM farms, Infra roles) and maintaining service reliability (CB, HW buffer, DR, etc.).

While USR VMs serve user traffic, BOT VMs are used for internal jobs (see 3.2 Service Topology). Each
USR VM uses 16 cores, and each BOT VM uses 8 cores. For each content farm, USR VM count : BOT VM
count ratio is 2.5 : 1. From capacity point of view, USR cores : BOT cores ratio is 5 : 1.

SPODS is SPO Directory Service designed for SPO services and connected with AAD (Azure Active
Directory), see Active Directory SPODS. We allocate 5 SPODS VMs for each content farm (moving
towards 4 VMs per farm). Each SPODS VM uses a full Physical Machine regardless of SKU.

Grid Manager farms, Infrastructure and infraCore farms each has specific capacity requirements, but
they only need a small percentage of compute capacity compared to the overall SPO capacity.

A significant amount of capacity is used for SPO service reliability. For all sellable cores, we keep the
same amount of capacity to support Disaster Recovery (DR).

Understanding SPO v2023 82 Microsoft Confidential


To manage unhealthy USRs, SPO uses Circuit Breaker (CB) to take unhealthy VMs out of rotation
automatically. After the VMs are in healthy state again, CB will put the VMs back to rotation (see more
in Handling USR Health – Circuit Breaker). For capacity management, we allocate a pre-defined
percentage of USR capacity per farm to support CB.

For each USR VM, we can use up to x% CPU as full capacity (x% is 65% as of Sept. 2022, moving towards
70%), this is where the “30% USR” category refers to as part of the Engineering Reserves.

SPO has on-going investments to reduce the capacity for Engineering Reserves, which will increase
utilization rate of compute capacity.

12.1.3 Content farm size definition and auto capacity


For each hardware SKU, there are multiple types of zone pairs:

• Full size or half size content zone pairs without search farms
• Full size or half size search zone pairs with search farms only
• Full size or half size mixed zone pairs with both content and search farms

Content zone pairs may contain Grid Manager farms and Infrastructure farms for infra role VMs.

Half size zone pairs are normally used for small GoLocal geos with low usages. The majority compute
capacity uses full size zone pairs. After search farms are migrated to Substrate, search zone pairs and
mixed zone pairs will go away. We’ll use full size content zone pairs to discuss content farm size
definition. Other cases are variants of it with minor changes.

We use a VM slot as a unit for compute capacity. Each VM slot represents 8 cores. Here’s the capacity
used by each SPO VM role (reference 3.3 Virtual Machine Topology).

Roles (VM) Capacity


USR 2 VM slots (16 cores)
BOT 1 VM slot (8 cores)
SPODS 1 full PMachine, 5 VMs per farm (moving to 4 VMs per farm)
DFR 6 VM slots, 2 VMs per zone
DSE 3 VM slots, 2 VMs per zone
WSU 2 VM slots, 2 VMs per zone
WDS 2 VM slots, 1 VM per data center
MOR 1 VM slot, 1 VM per reginal manager
TMT 2 VM slots, 1 VM per reginal manager
FFT 6 VM slots. 2 VMs per reginal manager
Note: SQL farms uses Azure SQL DBs, not using SPO compute capacity. Similarly, blob farms use Azure storage capacity, not
using SPO compute capacity either.

To calculate content farm size, i.e., the number USRs and BOTs per farm, we use the following formula:
Capacity for a content farm (VM slots) = (zone VM slots – hardware failure buffer – non-content farm capacity cost) / #
of content farms per zone

BOTs = Capacity for a content farm (VM slots) / 6 (where 1/6 capacity for BOTs and 5/6 capacity for USRs)

USRs = BOTs * 2.5

hardware failure buffer: 5% of zone capacity

Understanding SPO v2023 83 Microsoft Confidential


non-content farm capacity cost: infra roles + SPODS

# of content farms in a full size zone: default is 6

Ideally, we would like to have a standard farm size across SKUs. We used to define standard farm size as
160 USRs and 65 BOTs based on WCS Gen5. Since different SKUs have different number of cores per
zone, now we define farm size specific to each SKU to fully utilize zone pair capacity.

Current farm size definition for each sku


Sku ZoneType PmCount VmSlotPerPm VmSlotsPerZone FarmCount USR BOT
Gen9 content-full 480 6 2880 6 172 69
WCS5 content-full 288 10 2880 6 162 65
WCS6-103 content-full 256 13 3328 6 185 74
WCS6-104 content-full 240 13 3120 6 172 69
WCS7-103 content-full 256 13 3328 6 185 74
WCS7-104 content-full 224 13 2912 6 157 63

Farm size is subject to change depending on engineering efficiency improvements, such as hardware
failure buffer reduction and SPODS VM reduction.

Auto capacity job: the Auto Capacity job has two major functionalities.

1. Manage farm goals (USR and BOT goals): auto capacity monitors farm goals and fix incorrect
ones if two farms of a farm pair have different farm goals. When farm size definition changes
(e.g., from SPODS VM reduction), auto capacity will update farm goals of all production farms.
2. Keep the number of USR and BOT VMs to match farm goals. USRs and BOTs might be dead due
to different reasons. The physical servers on which these VMs are created could also turn
unhealthy. When the USRs and BOTs VMs are below the farm goals, the auto capacity job will
automatically deploy new VMs to meet the farm goals.

12.1.4 Capacity balancing


Multiple reasons may cause different farms to have different user loads:

• Tenants grow at different speed at different time


• Some tenants have load changes during a particular season, especially EDU tenants which have
usage spike during back to school (BTS) season
• Some tenants may have high usage in special situations, e.g., launching a new portal, Olympic
games, and COVID-19 which caused many companies and schools to go online with a lot more
usage on SPO services
• New farms are created empty with user load growing gradually
• For old farms in decommissioning phase, user load decreases gradually till fully empty when
using tenant move

Over time, some farms may have user load going closer to the farm capacity (limited by the number of
USRs). Capacity balancing will move some tenants from these farms (“hot farms”) to “colder farms” with
lower usage so that we can keep user load below farm capacity.

Understanding SPO v2023 84 Microsoft Confidential


The following diagram shows the basic idea of how auto capacity balancer works. When farm capacity
utilization goes over to a threshold (70% in this case), the capacity balancer will automatically pick the
hottest farm and move the database with maximum user load to the coldest farm. After this move
(Iteration 1), the farm user load may still be above the threshold. Capacity balancer will pick the next
hottest farm for balancing (Iteration 2).

Capacity balancer is capable of handling multiple criteria using linear optimization, e.g., considering SQL
cost for the moves besides reducing USR load in hot farms. The execution for balancing is not always
sequential. Capacity balancer generates move plans for multiple batches based on the above logic, while
tenant move batches normally run in parallel.

While tenant move is the major solution for capacity balancing, we also leverage farm open/close
mechanism to reduce tenant moves for balancing. When farm utilization reaches a predefined level,
farm open/close automation will automatically close those farms so that new tenants are provisioned in
low load farms, especially the newly deployed farms.

Capacity balancing for BTS season

Based on historical data, we have expected load increase for EDU tenants during back-to-school (BTS)
season. Capacity balancer will calculate expected load for BTS season and starts proactive balancing a
few months ahead of time. During BTS season, we are in reactive mode to balance unexpected load
spikes.

Capacity protection and emergency moves

Like SQL capacity protection, we have compute capacity protection. We can define protection at either
tenant level or database level. Based on the expected user load growth (spike), if there is a risk that the
farm load will go over farm capacity, we will either move the protected tenant or DB to a low load farm
or move other tenants (DBs) out of the farm. There are multiple criteria to make the decision of which
tenants and DBs to move. Normally we avoid moving large tenants with a lot of users due to time cost to
finish the moves.

Compute protection may require emergency moves when we get information too close to the date of
expected user load spike. There are other situations which require emergency moves, such as capacity

Understanding SPO v2023 85 Microsoft Confidential


shortage in a particular geo, or when partner teams (Azure storage or Azure SQL) hit capacity issues.
Move criteria could be blob capacity storage or SQL capacity shortage instead of compute capacity
shortage. We may suspend or rollback some tenants move batches in flight to speed up execution of
emergency moves.

12.1.5 GoLocal moves


Tenants in Macro-Geos like EMEA, US and APAC can opt-in for moves to GoLocal geos like Great Britain,
France, Japan, Canada and more based on their business locations. These move requests are managed
at O365 level. All Microsoft workloads receive a signal for such GoLocal moves.

For a GoLocal tenant in a shard DB, we must split the opt-in tenant out of the original shared content DB
to a new content DB so that we can move it to the GoLocal geo separately without impacting the other
tenants in the shared DB. After the DB split, we rely on tenant move to complete the move from the
source farm in the Macro-Geos to a farm in the GoLocal geo.

For a dedicated GoLocal tenant, DB split is not needed. We move all dedicated DBs of the tenant to the
GoLocal geo in a single tenant move batch.

Current SLA for GoLocal move is 24 months after the opt-in window is closed. After opt-in window is
closed for a GoLocal geo, there can be ad hoc special requests for GoLocal moves. We handle those
requests on demand.

GoLocal Playbook

GoLocal geo capacity management is challenging due to a lot more uncertainty of user load and limited
resources (SPO compute capacity, Azure storage and Azure SQL capacity). In case a GoLocal geo is
running out of capacity, we’ll follow the process of M365 Go-Locals Playbook (PM owner: Rebecca
Gee) to mitigate the capacity shortage. The playbook covers:

1. Capacity evaluation criteria,


2. Capacity mitigation steps,
3. Legal commitment (CELA),
4. Customer priority, impact, notification, and geo-mapping
5. Executive communications.
6. Post-constraint recovery: When we need to move committed customers back, what’s the priority, SLA and
communication.

To support the process, we keep track of tenants that we can potentially move out of each GoLocal geo
in case of emergency, including tenants’ compute, blob and SQL usages. With data at hands, we can
make decisions quickly about which tenants we should move out under specific capacity constraints.

12.1.6 Tenant move and network move


Tenant move (TM) is a gesture to move tenants from one content farm to another content farm.
Tenant move unit is a batch, which is either a shared database with multiple tenants or a dedicated
tenant with one or more dedicated databases.

Understanding SPO v2023 86 Microsoft Confidential


Tenant move workflow includes the following key phases:

Cleanup
TM Plan
Generation PreStage Move

Rollback

TM plan generation is the starting point of tenant move automation for capacity balancing, GoLocal
moves, and ad hoc TM requests. After the TM plan is created, the TM execution goes through the
following phases.

• PreStage: prepares the move which mainly focuses on 1) dual syncing tenant’s AAD data into
both the source farm and the target farm 2) continuously copying tenant’s SPO content from the
source farm to the target farm
• Move: flips the tenants from the source farm to the target farm. The flip usually happens during
the off-peak hours of tenant’s Geo
• Cleanup: remove tenants and corresponding content DBs from the source farm
• Rollback: move tenants back to the original source farm before cleanup starts in case something
goes wrong in the middle of the PreStage phase or move phase

Cross-RM TM: to support moving tenants across Regional Manager (RM), TM needs to replicate grid
metadata for the tenants from the source regional manager to the target regional manager. It requires
additional logic to communicate between farms in two different Reginal Managers. This is accomplished
by building cross-RM technology on top of service bus mechanism provided by Gird Manager team.

TM ReadOnly time SLA: under 5 minutes for 99th percentile

Tenant move ReadOnly time happens in Move phase, which includes the following key tasks:

• Failover Azure SQL DB from source farm to target farm: < 2 minutes
• Restamp tenants: < 2 minutes
• GMUpdate: < 1 minute

Failover: changes primary DB from source farm to target farm, makes the content DB read-write in
the target farm and read-only in the source farm

Restamp: updates tenant’s DNS pointing to the clump in the target farm

GMUpdate: updates grid metadata to switch the tenants from the source farm to the target farm

Network move (NM) is a gesture to move all tenants in a content farm from one zone to another
zone, typically to move a farm label from one zone pair to another zone pair.

Here is a high-level view of how network move works:

Understanding SPO v2023 87 Microsoft Confidential


Network move starts from standby farm deployment. Standby farms are built based on source farm
configurations including farm label, farm goals, SQL servers, pools, GeoDR DBs (continuous copy from
primary DBs in the source farm), as well as the setup for Redis cache and blob. During attach phase, the
standby farm will replace the old farm to become a new active farm in the Active-Active farm pair. Farm
failover will move user traffic from the old farm to the new farm. Old farms will be torn down in the end.

There are two scenarios for network move, one is to move farms within Azure region and the other is
cross region move.

• Same region network move dismounts the SQL servers and DBs from the old farm and remount
them to the new standby farm, which will avoid setting up GeoDR continuous copy and save SQL
COGs during the move.
• Cross region move must create GeoDR DBs in the other region. The databases in the old farms
will be dismounted and removed after the move.

12.2 Active-Active
What is Active-Active: In Active-Active architecture, two independent systems actively run the same
service simultaneously. User traffic goes to both systems. In case one system is down, user traffic is
redirected to the other system. After the unhealthy system is recovered, user traffic is balanced
between the two Active-Active systems again.

12.2.1 Active-Passive model vs. Active-Active model


SPO adopted Active-Passive (Primary-DR) model in the past as illustrated at the left side of the diagram
below. Active-Passive model has several problems:

• passive system (recovery farm) is not fully validated (no RW traffic)


• 50% capacity is not being utilized
• failover is at full farm level which can cause customer down time even though customers are in
healthy DBs in the same farm as the problem DBs

Understanding SPO v2023 88 Microsoft Confidential


Active-Active model shown at the right side of the diagram addressed the Active-Passive model
problems.

Figure 12-1 Active-Active Farm design

Benefits of SPO Active-Active architecture


Reliability: In Active-Active model, both systems (SPO farms) are actively serving user traffic with full
monitoring and validations. Each farm serves about half user traffic with capacity allocated to support
full user load in case of one farm failure. As a result, there is more capacity buffer in each farm to handle
user load spikes.

Performance: By utilizing the capacity allocated to both farms, the user traffic is split between the two
farms, which results in lower CPU usage on the web server which in turn leads to better performance.

Failover granularity: in SPO Active-Active model, databases are organized into database clumps. In case
one DB or a limited number of DBs have health issues, we can failover individual clumps to the other
farm instead of doing full farm failover. This will avoid customer impact on healthy clumps.

12.2.2 Active-Active design


In Active-Active model, SPO farm pairs still exist. Content databases in the farm pairs are still in Primary-
Recovery mode. Some content databases are active in Farm A and some are active in farm B. Each farm
serves content from its primary content DBs.

Databases are organized into database clumps. A DB clump is the unit of failover and routing. The
diagram below shows the relationship between Azure SQL databases, SPO logical DBs and DB clumps.

Understanding SPO v2023 89 Microsoft Confidential


A logical DB can be thought of as the global storage point for data. For example, Content_123 might
have a primary and a mirror and a log replay and a standby log replay, but there is only 1 Content_123
logical DB which all those point to. Logical DBs only exist for Content and DedicatedContent databases.

By default, a database clump will have up to 5 logical databases. However, no tenant can span multiple
DB clumps which means that some clumps have more databases because some tenants have more than
5 dedicated databases. For example, the MSIT tenant in the ProdBubble MSIT_US_1_Contant farm has
over 1000 DBs in a clump and the Accenture tenant in the US_201_Content farm has over 500 DBs in a
clump.

12.2.3 Clump balancing


To keep user load evenly distributed between the two farms of an Active-Active farm pair, we have auto
clump balancer running for each farm label and balance the workload during off peak hours.

Clump balancer evaluates user load of each clump based on the DB level user load from compute
capacity management. The balancer has multiple balancing strategies and tries to apply the best
balancing strategy for clump failover. Since the user load is not evenly distributed among content DBs
and clumps, we cannot reach a perfect 50:50 distribution ratio. Clump balancer’s job is to find the
optimal solution to minimize the user load difference between the two farms of each farm pair.

Balancing configuration: Due to live site incidents or special business needs, we may have to disable
auto clump balancing. Here are the basic rules:

• If auto failover is disabled, auto clump balancing is automatically disabled as well


• After a farm level failover, auto clump balancing will not start until 48 hours later
• Auto clump balancing can be manually disabled / enabled at environment level or farm level
• To avoid massive clump failovers for clump balancing after large scale failover, we set daily
balancing limit to 30 farms per day which can be configured

Special balancing: besides the default 50:50 balancing rule, we support special balancing requirements:

• 100:0 and any ratio balancing


• Preferred location (data center) for a given tenant
• Keeping specific test tenants to be on the opposite side of the main traffic side (in SPDF)

Balancing history: To help investigate live site issues, we have clump balancing history stored in
SPOReports database and provide a FarmBalanceReport dashboard.

12.2.4 Clump health


To keep Active-Active in healthy state, we have a comprehensive clump health monitoring and alerting
mechanism. Here is a list of clump health problem areas:
Tenant DBs Spanning Clumps
Mismatch between DB and Clump Farm Pair
Tenant Wrong ClumpId
Tenant Not In Clump
Empty Clump Not Cleaned
Unable to Calculate DB Farm Pair
Single-sided DB

Understanding SPO v2023 90 Microsoft Confidential


Multiple Open LDB Histories for LogicalDb
Failed to Update Tenant Clump DNS
Farm Label has less than 2 clumps with XamHb tenants
Farm Label has less than 5 XamHb Tenants
Farm Rebalancing Blocked
Tenant in Non-MTE Clump
LogicalDb in Moving State
MoveLogicalDb failed to Update Tenant Clump DNS
DB not on LogicalDb
DB Farm Pair mismatch with Clump Farm pair
Unable to Calculate DB Farm Pair

For cases with impact on failover, alerts are integrated into DR readiness. The clump health monitor also
has auto-heal ability to fix a set of known problems.

Understanding SPO v2023 91 Microsoft Confidential


12.3 Disaster Recovery
Disaster recovery (DR), also known as Failover, is a process to move workload from a failed or unhealthy
set of resources to a healthy set of resources to help keep the SPO service available. The scope of a
failover can be as little as a few databases, or as large as an entire region which can sometimes consist
of hundreds of farms. While large scale failover happens only a couple times per year, the smaller ones
happen daily. On average, nearly 10,000 databases per month will be failed over in the SPO service for
various reasons. Most of the time a failover will only cause a couple minutes of read-only time for the
customer.

12.3.1 Key Metrics for DR


The success of the SPO DR system is measured by two important metrics:

• RTO (recovery time objective) is the time it takes to restore the availability of SPO starting from
when availability was lost until the customer regains full access to their content. In most
months, SPO can restore availability using failover in less than 30 minutes at P95 of the
databases.

• RPO (recovery point objective) is the amount of data loss sustained if there were a failover. The
data loss is caused by the data replication lag between geo locations. SPO keeps this number as
small as possible. If the RPO of a database becomes larger than 30 minutes, an escalation will be
fired. If the RPO of any database is more than 60 minutes, failover will not be allowed to
proceed.

12.3.2 Capacity buildout


DR is a critical component to support high availability in SPO. Almost daily there are small incidents that
leverage DR failover to help mitigate them and every year, there will be a few large-scale incidents as
well. For example, in the spring of 2021 a severe winter storm hit the Texas region and the San Antonio
region data centers came very close to a black out. The DR team failed out more than 200 farms,
containing 15,000+ SQL Azure databases to other safer region and averted potential customer risk.

This capability does not come for free. For each farm label, SPO builds two farms which are at least 250
miles apart. Each farm contains its own compute machines (USR and BOT). Each content database also
sets up a continuous copy database in the other farm. Continuous copy is a SQL Azure feature which
replicates all changes in a database to a geographically remote copy. The source database is called the
primary database, which will allow read-write access. The remote copy is called the geo-secondary
database which is read-only. Figure 12-2 Primary and Recovery Farms and Databases shows the
relationship between the farm pairs and database pairs. SQL Azure is responsible for replicating every
change that happens to the primary database to the recovery database. There is usually a small delay
before the change appears on the recovery database. This delta is the source of data loss when a
failover happens before some change could be replicated. As we know, this time delta is also called the
RPO. SPO monitors database RPO closely and will fire alerts if the RPO is over certain thresholds. More
information about SQL GeoDR and replication could be found here - https://docs.microsoft.com/en-
us/azure/azure-sql/database/active-geo-replication-overview.

Understanding SPO v2023 92 Microsoft Confidential


Note that the primary farm can contain some primary databases and some secondary databases. Thanks
to the active/active configuration which was discussed in section 11.2 Active-Active, a clump of
databases can be receive traffic in either side of the farm pair.

Figure 12-2 Primary and Recovery Farms and Databases

12.3.3 DR Dashboard
DR Dashboard, accessible to all ODSP engineers at drdashboard.azurewebsites.net, is a web site which
provides information on all aspects of disaster recovery in SPO. DR engineers and SRE often use it to find
information about ongoing failovers, or farm DR readiness. Engineers who are doing incident
postmortem also go there to study the failovers in the recent past. The web site keeps detailed
information about each failover, such as the time each step was completed, error information from the
processing jobs, etc. The following picture is a screen shot from the world map view on this dashboard.
In this picture things are looking good overall with only two data centers with some farms that are not
ready for failover.

Understanding SPO v2023 93 Microsoft Confidential


12.3.4 Failover
The failover process could be executed in several different ways:

A scheduled failover is a planned failover that is normally scheduled in advance and can be postponed if
something unexpected happened. It will require OCEs/SREs to create scheduled failover activities before
failover. Those failover activities will tell other gestures to yield. Scheduled failover uses friendly failover,
which allows SQL Azure to finish all replication to guarantee zero data loss; however, if the failover gets
stuck it will elevate to forced failover and allow data loss.

A proactive failover is a response to emerging risk which will likely become a sev 2 or more severe
incident within a short amount of time. Like scheduled failover, proactive failover will attempt to
failover with zero data loss before elevating to forced failover.

The most aggressive type is called Unscheduled failover, which is usually a response to sev-2, sev-1, or
sev-0 incidents. It likely will incur a small amount of data loss during failover as it will use forced failover
at the SQL level.

A failover execution usually consists of 3 stages:

(1) Pre-Failover. Check traffic lights to see if the failover can proceed (more on this later).
(2) Failover. In this stage, the process promotes the recovery database in all the database
pairs to become the geo-primary (the original primary to be the new geo-secondary).
Also in this step, the failover process will update the DNS entry which brings traffic to
the farm to go to the corresponding recovery side. In some cases, this step would launch
the failover of other service components, such as the SPODS forest and the directory
database.

Understanding SPO v2023 94 Microsoft Confidential


(3) Post Failover. In this step, the process updates the grid database-tenant mapping. In
some cases, it will also update the topology information used by the SPO active
monitoring system to inform it about the new primaries.

The following figure shows how this process works.

Figure 12-3 Failover Process

12.3.5 Traffic Light


Traffic Light checks the DR Readiness by constantly monitoring the health of farms and geo-secondary
DBs to make sure when an incident happens, the recovery side is healthy and ready to be failed over
into. The following figure illustrates how the DR dashboard shows the traffic light results for farms in the
SPO service.

Figure 12-4

Traffic Light has four colors:

1) Green is healthy to failover


2) Yellow is ready for failover, but there might be some minor issues or something to fix after
failover.
3) Red will block failover unless it’s overridden by OCE.
4) Grey means unknown, usually because a monitor isn’t working, or the data is too old.

Traffic Light looks at a few different dimensions to gauge DR’s health.

DR Home Page availability – this is a signal from active monitoring which is making requests to a
SharePoint homepage on a test tenant. This tests the overall health of the whole stack including

Understanding SPO v2023 95 Microsoft Confidential


authentication, web server, and database. The DR Home page probe is IP based, not relying on
DNS.

Capacity – looks at the amount of USR & BOT VMs to ensure there is enough available capacity
in the other farm to handle all the traffic after a failover.

DB Health –First, it checks the database probe which connects to the recovery database and
does a simple query to check if the DB is available. Second, it checks that the database is
mounted to SharePoint.

RPO 60 Min – This ensures that the recovery database has the data replicated to within 60
minutes of the primary database.

Alerts – integrates with other alerts which help tell whether DR side is healthy.

12.3.6 Auto failover


Auto failover automates failover decisions for incidents, so failover happens with no engineer
involvement. This is important because human decision has been one of the slow components in the
process of mitigating an incident. Automating the failover decision is critical for us to reduce MTTM
(mean time to mitigation).

A farm will be failed over automatically if one of the following criteria is met:

• Farm level active monitoring success rate < 85% for 5 minutes
• Farm level active monitoring success rate < 90% for 10 minutes
• Farm level active monitoring success rate < 95% for 30 minutes

There is also a trigger the database level if the primary database is down and existing DB auto-heals
cannot bring it back online.

The following table shows the percentage of failovers automated in the first half of 2022.

Month Jan Feb Mar Apr May Jun


Total database in failover 11243 17424 9051 8092 6278 12138
Auto failover percentage 87.00% 51.42% 98.39% 94.29% 98.62% 69.96%

Understanding SPO v2023 96 Microsoft Confidential


Chapter 13 Distributed Cache
In SPO there is no affinity between a front-end (FE) machine aka USR and an incoming request. As a
result, two requests from the same user can land on two different FEs. Now if the resource is static in
nature or doesn’t change too frequently; for example, a user’s logon token, which is typically good for
couple of hours, it can be fetched once from the store and then can be put in local cache on FE, to serve
the subsequent requests faster, without going to the store. Since we lack FE affinity, the same process
needs to be done on all the FEs where the user’s request lands for the first time, which is not only slow
but also resource intensive on the store and results in duplication of cached data across the machines. In
SPO, we’ve lots of such scenarios which lead us to move towards distributed caches. As of today, we
support two types of distributed caching technologies Azure-Redis and Distributed SSD Cache.
13.1 Azure Redis Cache
Azure-Redis Cache is a hosted solution provided by Microsoft Azure, where the cache items live on RAM
of the VMs hosted by Azure Redis and are accessible only from the clients within a content farm e.g.,
USR/BOTs

13.1.1 Why Azure Redis?


The first distributed cache solution was built in 2010 on a tech developed by MSFT on windows, called
AppFabric-cache aka velocity, where we had a per farm cluster consisting of dedicated VMs role called
DCH (Distributed cache host). We were in-charge of installing/deploying/configuring/upgrading and
maintaining the DCHs. The solution worked great for the initial few years, as SPO continued to grow, so
did our caching needs, and we started experiencing a bunch of issues in our caching infrastructure. Our
typical cluster size grew from a few nodes to 40-50 nodes; however, velocity was not designed for this
scale, large size clusters were less reliable and had a direct adverse impact on farm availability. Scale out
issues further curtailed our ability to onboard new scenarios and increased time to repair. Around 2015,
velocity was no longer officially supported, as a result a lot of duct taping and reverse engineering/hacks
were needed on a continuous basis to keep the service running. All in all, a maintenance nightmare!
Additionally, SPO caching infrastructure had its own shortcomings mainly, its tight coupling with host
(SPO) and the underlying caching tech.

To address these challenges, in 2019 we started our journey to build the next version of caching
platform. The first thing was to find a replacement for velocity which should be highly available/reliable,
scale-able, provide high throughput, secure & compliant, well-proven, ready to use & officially support
and most importantly apt for our caching scenarios both in terms of speeds and access pattern.

In SPO, at a high level, our caching scenarios can be pivoted by payload sizes; small (<=8KB), large
(>=512KB) and medium in between, or by their access patterns like single value puts/gets, multi-values
get, optimistic concurrency etc. The majority of our caching scenarios fall under small payload + single
value puts/gets.

With these pre-requisites in mind, we did not have many choices for caching technology, unless we want
to build one ourselves, however, remember we want something which is ready to use and building a
distributed cache from scratch is not something trivial. But what about open-source caches like
MemcacheD and Redis? Unfortunately, these technologies are more geared towards Linux. Yes, they
did have or at least had a windows version at that time, but the support story wasn’t very coherent,
breaking our requirement of officially supported.

Understanding SPO v2023 97 Microsoft Confidential


Fortunately, there were two ready to use caching technologies: Azure-Redis-as-Service & Object Store.
Based on the tests, Redis was fastest for small & medium payloads while Object store aced in for large
payloads. Additionally, Redis provides a rich set of features like atomic counter; sorted sets, Hash-tables,
pub-sub etc., while Object store had capabilities to be a global store. It’s worthwhile to mention, in
order to talk to these remote caches, the traffic needs to be SNAT’d and our current SNAT devices don’t
have enough capacity to handle the load; as a result, in SPO, we by-pass SNAT devices and talk directly
to the Azure-endpoint and teach both SPO and Azure endpoints to understand our clients (USR/BOTs)
backend IPs. This is called Leak-Routing in SPO lingo and adds one more pre-requisite: the caching
technology must support Leak-Routing capabilities, which Azure-Redis can support, however was not
supported/recommended for Object Store.

Based on our needs we chose to replace Velocity with Azure-Redis!

13.1.2 Deployment and Topology


Azure Redis caches are deployed under SPO-Azure subscriptions, we have one such subscription for
every region and one service principal per environment. The service principal is the identity used to
perform administrative operations like creating/deleting Azure Redis clusters. Currently, we have one
Redis cluster per content farm id. Most of our farms use Redis P2 SKU with 3 shards, where a shard
consists of a machine pair for high availability. There are a few exceptions, for example MSIT, where we
use a higher SKU and more shards to handle the load.

These clusters are deployed as a part of content farm deployment via Grid jobs and are co-located
within the same region as the content farm for better latencies. Additionally, we also configure cluster
level alerts, IP filters, set up static routes to support leak-routing and enable DNS hardening.

13.1.3 Expiry and Eviction


We provide TTL based expiry and LRU based eviction.

13.1.4 Consistency Model


As mentioned earlier, a shard is a pair of VMs – Primary & Follower. The keys are synchronously written
to the primary and then asynchronously replicated to the follower. The replication usually happens
within microseconds. However, there could be a corner case, where a key got updated on the primary
and the primary went down before the update could be replicated to the follower. In this case, there is
an infinitesimal chance that we may return stale data. Caches by design can be stale, since most of our
scenarios can live with staleness and if a scenario is looking for accuracy, we encourage them to do
some verification on returned payload at their end. For all practical purposes, Azure-Redis provides a
strong consistency model.

13.1.5 Security
The client to server authentication happens via per cluster connection strings and the communication
happens using SSL over TCP. By default, all our payloads are encrypted before transmitting over the wire
using per farm secret key. Additionally, we provide key scope verification to avoid cache leakage.

Both Redis connections strings and secret keys are rotated periodically, furthermore, we configure
firewalls rules on each cluster to keep the system secure and compliant.

Understanding SPO v2023 98 Microsoft Confidential


13.1.6 DNS hardening
Given Azure Redis endpoints IPs are static in nature, we harden host files on each VM (USR/BOT) to
protect ourselves from DNS outages. This is done via a per VM Timer job, which periodically pings the
Redis endpoints and updates the host file with new IP in case there is a change.

13.1.7 Leak Routing


By default, all outbound traffic across SPO logical network gets SNAT’d at our F5 devices. However, due
to the limited capacity of SNATing devices, we often run into availability issues. To address this problem,
we deployed our Azure-Redis caches in a special HyperNet-Lite environment, which enables the
communication between SPO VMs backend-IPs and Azure Redis Public IPs, thus eliminating the need for
SNATing. This is done by setting up the static routes using a Timer job.

13.1.8 Azure-Redis Only Failover


To help farm availability, we provide manual Redis only failover capability, wherein we can direct cache
traffic from Primary content farm to the Redis cluster of the DR farm. Note, we do not replicate data
between PR & DR Redis clusters, so the caches are cold started when we do Redis-only failover or full
farm/Clump failover.

13.1.9 Scaling
By default, Azure-Redis does not provide elastic scaling, however, we have built mechanisms where we
can manually scale in/out or scale up/down a Redis cluster based on our needs.

13.1.10 Cache Client Library


As mentioned earlier, one of the shortcomings of caching infrastructure was its tight coupling with the
host (SharePoint) and the underlying caching technology (velocity). This made it exceedingly difficult for
us to reuse the client code among various hosts and hinders our ability to quickly change the underlying
caching tech.

To solve this problem, we created a cache client library, which is both host & caching tech agnostics. At a
high level, there are 2 components

1. Cache provider:is a set of APIs/functionality specific to a particular caching tech, e.g., for Redis it
is the set of apis, exposed by StackExchange.Redis 2.0.601.
2. Pattern based caches: Instead of programming against the cache provider’s specific apis, the
library exposes common caching pattern constructs like IDistributedCache which supports
simple puts/gets, or IDistrbutedCacheSlidingWindowCounter to support a global atomic
counter, IDistributedCachePubSub etc. This gives us the ability to use multiple caching providers
like Redis and SSD to serve better our caching needs, while being completely transparent to the
callers.

Additionally, the library provides default functionality of serialization, compression, encryption etc.,
which if needed can be extended at a caching scenario level and lot more performance and reliability
optimizations. For detailed information please visit SPO Distributed Cache Wiki

Understanding SPO v2023 99 Microsoft Confidential


13.1.11 Telemetry, Monitoring and Alerts
Cache infrastructure provides extensive end to end Telemetry (Geneva/Cosmos/PowerBI) for all the
cache scenarios. Currently, we’ve 30+ scenarios leveraging the caching infrastructure. While we get
telemetry for all the scenarios, we only monitor or alert-on marquee scenarios. However, each scenario
can leverage telemetry data to create their own alerts.

13.1.12 SLA
We provide 99.99 availability and 5ms @P95th for small payloads of <=8KB.

13.2 Azure Redis Persisted Cache


Azure-Redis Persisted Cache is almost like Azure Redis cache described in section 13.1, except, the
cache items survive the shard outages on reboots. This is done by writing each operation asynchronous
in a log file, which is stored in per cluster storage account. When a shard comes back only, it does log
replay to warm up the cache.

13.3 SSDCache
SSDCache is SPO in-house build organic key\value pair distributed cache solution. It leverages un-used
SSD disk space on USR VMs to build a per farm cache cluster; by persisting cache object on the disks in
the cluster, it helps on-boarded component to save cost and improve performance.

13.3.1 Why SSDCache?


• SPO content farm capacity:
o In all SPO farms, most of the time, USR VMs have a large amount of unused disk space.
Here is free MB per USR in MSIT farms, which each has ~250 USR VMs, total unused SSD
space (only consider c:\) in each MSIT farm is ~34TB;

• Performance:
o SSDCache provides an extremely low latency, especially for large payload. All cache data
is persisted and transferred within SPO farm. And we are trying to maximize the
connection reusability (currently >99% of cache requests lend on existing connections).
For <1M payload, the P95 get latency is <10ms; and for <4M payload, the P95 get
latency is <30 ms;
• COGS:
o SSDCache cluster is built using existing USR VMs, no additional hardware required, no
transactional cost for the on-board scenarios;
o All data is persisted\transferred within farm only, almost no network cost

Understanding SPO v2023 100 Microsoft Confidential


13.3.2 SSDCache Cluster
• SSDCache is using ring-based consistent hashing algorithm to distribute cache items (based on
the cache key) to each node in SSDCache cluster;
• A windows service, Cache Service is built and deployed on every USR VMs; Once the service is
up and running, it will do self-diagnostics first, upon passing the diagnostic test, it will join the
cluster by sending heartbeat to cluster cache store (In SPO farms, Azure Redis is used as the
cache cluster store); Cache service also subscribe to Azure Redis pub\sub system, so that they
will get the notified up cluster change (nodes join\leave cluster)
• In a SSDCluster, there will be one and only one node acts as “Leader” node, which get all
heartbeat information, to detect whether there is cluster change; if so, cluster information will
be updated in the cluster store, and also published to the cluster store;
• Each SSDCache client gets the cluster information from cache service and uses the consistent
hashing algorithm to find the node which owns the cache key.

13.3.3 SSDCache protocol (COP)


• Cache Operation Protocol (COP) is created for cross machine data transferring
o Tcp is used to transfer cache message (request\response) and cache items;
o Each Tcp connection is authenticated using certificate based Tls;
o Both client and server certificate is required for the connection authentication to make
sure the data transfer could happen only within SPO farms;
o All cache messages are serialized\deserialized using protobuf to improve performance
o All cache message and cache data are framed by data size
o We try to pre-create connections for critical scenarios, so that the majority of the cache
requests will lend on existing connections.

13.3.4 Deployment and Topology


• SSDCacheClient: has been integrated with SPO cache client library; As part of SPO content farm
deployment process, the module is available on all VMs in SPO content farms
• CacheService: as part of SPO content farm deployment process, the cache service (a windows
service) is deployed on all VMs, but the service is only enabled and running on USR

13.3.5 Expiry and Eviction


SSDCache provide TTL based expiry and FIFO based eviction; For each on-boarded scenario, a disk quota
is provisioned on a per VM based; A SSDCache timer job is running hourly on each USR VM to clean up
expired cache item; and evict cache items in case the scenario is close to exhaust its pre-configured disk
quota.

13.3.6 On-boarding process


SSDCache has been integrated with cache client library to make it easy to get on-board; In SPO,
SSDCache usage settings have been created, for each scenario, they could use those settings or extend
the settings to meet their requirements.

13.3.7 SSDCache Challenges


• In SPO farm, each VM is designed to be stateless; any time, there could be new VMs added, and
existing VMs being removed; It brings two challenges to SSD cluster:

Understanding SPO v2023 101 Microsoft Confidential


o Almost impossible to build primary\secondary for each node to mitigate availability
drop when a node gets down;
o When cluster changes due to node joining\leaving, certain percentage of cache key will
get re-sharding; there will be a time window in which client and server have different
version of the cluster
o SSDCache have couple of configurable cluster timing parameters to tune to minimize
the impact due to cluster change: e.g. time threshold for the leader node to decide
whether it shall remove a node from cluster due to missing heartbeat information; how
frequently client should pull cluster information from cache service; etc
• Currently, SSDCache only supports data transferring within SPO farms. There is no pipeline to
replicate the SSDCache item to DR farm. When fail over happens, the DR farm has to go through
cache building up process. The time window to build up cache will vary by scenario.

Understanding SPO v2023 102 Microsoft Confidential


Chapter 14 Telemetry Ecosystem – Instrumentation, Monitoring and
Alerting

14.1 Primary Azure Tools Used for the SPO Instrumentation, Monitoring and Alerting
ecosystem
SPO leverages the Geneva Monitoring ecosystem from Azure which provides the technology foundation
for our Instrumentation, Monitoring and Alerting capabilities. These same data are also used for the
ODSP Business Metrics (Analytics) capability as well. Read more about the Geneva Monitoring
Infrastructure here.

• Instrumentation: data that is logged and then get processed, in some cases aggregated and then
stored. Logs and Metrics (defined below) are examples of instrumentation. Instrumentation
can come from our servers, clients or even synthetic monitoring systems.
• Monitoring: the process whereby the aforementioned Instrumentation data are used to assess
and then optionally act on service health issues.
• Alerting: the active and automated process whereby the aforementioned Instrumentation data
are conditionally assessed and if static or dynamic thresholds are met, generates a ticket
through the Microsoft Incident Management system known as ICM. These tickets, based on
rules, generate either paging (phone call) or non-paging (email) alerts.

Our IMA ecosystem generally categorizes these data into two broad types – Logs and Metrics:

• Logs: Logs are the raw schematized data that are emitted and then processed and stored by
Geneva, usually called MDS. Logs are typically queried by the dgrep interface provided by
Geneva. Getting access to the SPO logs is described in the Wiki here.
• Metrics: A Metric is an aggregated measurement of an event or series of events that have
occurred over period of time, generally over a set of defined dimensions. A simple example is
the Total number of Successful (aka TotalSuccess) Executions (the Metric) for a given API (the
Dimension). These data are nodal-aggregated in one-minute increments across a pre-defined
set of dimensions, which is called a Preaggregate.

A few other key terms and technologies that are used by this ecosystem include:

• CIL: Our Common Instrumentation Library, which is a nuget package implemented wrapper
around IFX which is a SDK used by Geneva to collects logs and metrics on nodes.
• MA: The Geneva Monitoring Agent, which is a nodal process that is responsible for uploading
logs and aggregated metrics to the Geneva Ingestion service, thereby enabling Monitoring and
Alerting.
• Geneva Dashboards and Widgets: Geneva Dashboards are collections of Widgets which
generally show temporal summaries of Metrics, and is the primary tool used by SPO for
Monitoring the service. Dashboards are created and accessed from the Geneva site.
• DGrep: Geneva DGrep is the primary interface used to query our Logs. It is also accessed from
the Geneva site.

Understanding SPO v2023 103 Microsoft Confidential


14.2 SPO-Specific technology and tools that in some cases augment Geneva
• ULS, ULS
• Radio
• Passive/Active Concepts
• Passive: QosService*
• Active: Geneva Synthetics and the various Runners in use
• XamHB tenants

Understanding SPO v2023 104 Microsoft Confidential


Chapter 15 SPO Security
Our aim is to make SharePoint Online the safest cloud for the world’s most valuable data.

15.1 Security Fundamentals


Maintaining the security, privacy, and compliance of our customers’ data is the responsibility of every
engineer – just like accessibility, performance, and reliability.

The security of our product is always improving or degrading – there is no “status quo”. As we build new
features and leverage new technologies, we introduce new risk. Likewise, as threat actors become more
capable it allows existing, hidden risks to become visible and exploitable.

To achieve our goal of being the safest cloud for the world’s most valuable data, we invest in Security
Fundamentals – a continuous process of risk identification, monitoring, and reduction:

• Security reviews protect the product as designed. Our goal is to empower engineers to achieve
business outcomes without increasing risk. We strive for an attitude of “yes and here’s how”
rather than a “culture of no”.
• Security monitoring protects the product as implemented. It allows us to uncover new risks
soon after they are introduced, and to detect abuse of existing risks that haven’t yet been
mitigated.
• Security investments are made by every team at every layer of the product to make bad
security outcomes less likely or less impactful.
• Security research keeps us grounded in reality. Pen-test exercises and bug bounty submissions
tell us the truth about our product and where we need to improve. Industry research helps us
identify trends or new classes of vulnerability that we need to stay ahead of.

We do this work with a “buck stops here” mentality – we leverage the work of others where possible
but never outsource the accountability for protecting our customers.

15.2 Security investments at every layer


Every layer of the SharePoint Online platform is designed in a way that minimizes the likelihood and
impact of a bad security outcome:

• Engineering systems
o Code changes require Yubikey proof-of-presence to ensure that a trusted engineer
signed off on the changes in the PR.
o Builds execute on trustworthy servers and build outputs are signed.
o Only signed code is allowed to execute on high-risk roles such as domain controllers.
• Service management
o The service is designed with breach boundaries such that an intrusion in a SharePoint
farm cannot spread to other farms, limiting the impact to <1% of customers.
o Grid Manager uses certificate authentication and Job Agent rather than domain
accounts or Remote PowerShell to prevent credentials from being stolen.
o The Windows firewall is configured to restrict network connectivity between
environments.

Understanding SPO v2023 105 Microsoft Confidential


o SharePoint farm VMs are harvested (torn down and replaced) every 60d to limit the
lifetime of malware on a given VM
• Content storage
o Customer content is encrypted by SharePoint before being transmitted to Azure
storage. Farms use least-privilege SAS tokens that only allow new content to be added,
never deleted or overwritten. Storage account keys are rotated automatically.
o Access to SQL databases is controlled by a firewall policy that only permits access from
SharePoint-owned address ranges.
o BitLocker encrypts datacenter drives using a TPM-backed key.
• Web application
o IIS runs under a least-privileged virtual account which does not have admin privileges,
cannot write .aspx pages to product directories, and cannot access high-privileged
certificates
o DotNetLocker and Detours prevent unsafe ASP.NET controls or unsafe APIs such as
Process.Start from being executed by an attacker.
o Web.config secrets are encrypted and automatically rotated.
o High-risk functionality such as Sandboxed Solutions, 2010 Workflows, and the
WebPartPages web service are disabled.
• Authentication and authorization
o Logical permissions restrict the APIs that pre-authorized applications can access.
• Customer experiences
o CDN assets are backed by immutable storage and referenced using subresource
integrity.
o Content security policy is used to prevent script execution in modern list and library
experiences.

15.3 Security monitoring


We monitor every technology used by SharePoint Online to detect unauthorized access that could
impact the confidentiality, integrity, or availability of customer content.

Our goal is to detect unauthorized access within 10m and evict the intruder within 3h. We use a variety
of agents and microservices to accomplish this:

• Windows compute is monitored using the HostIDS security agent which captures ETW telemetry
and sends it to a real-time service called Observer.
• Azure storage, SQL, and Key Vault resources are monitored using a service called Azure Log
Monitor which consumes audit events from every data plane interaction with these services.
• SharePoint emits ETW events for each incoming S2S request. These are captured by the HostIDS
security agent and analyzed in the Observer pipeline to detect activity by a stolen first-party
identity.
• We consume many other sources of data such as Azure DevOps activity, AzureRM control plane
activity and Azure Networking syslog activity to detect unauthorized activity.
• Detection results from every source are ingested into an in-memory graph database called
ClusterBot.
• SPO engineers review detection results daily to identify unauthorized activity.

Understanding SPO v2023 106 Microsoft Confidential


These services operate on a massive scale:

• HostIDS agents process 300 million events per hour across 500K servers.
• These agents send 130K Windows security observations per hour to the Observer service.
• Azure Log Monitor processes 13 billion data plane events per hour
• Our systems flag 4000 suspicious events per hour out of these telemetry streams
• Of these, roughly 25 per hour are sent for human review.
• On average, high-risk activity results in a paging alert to the on-call engineer every 4 days.

We exercise detection and response in a yearly red team engagement to ensure that our tech, people,
and processes are all prepared to handle a real threat actor.

Understanding SPO v2023 107 Microsoft Confidential


Chapter 16 SPO SRE
SRE is what you get when you treat operations as if it’s a software problem. The main goals are to create
scalable and highly reliable software systems.

As SRE’s we flip between the minutiae of server level disk write latencies to the macro view of how to
ensure the redundancy of critical ODSP services during unprecedented outages, at enormous scale.

16.1 Our Vision & Mission


Drive improved reliability in the service, innovation, and the adoption of best practices of Site Reliability
engineering across ODSP.

Obsess on service excellence by understanding real life customer experience based on extensive
availability and reliability signals. Through the provision of a highly available and reliable product, SRE
aims to empower every ODSP user to achieve more.

16.2 Team Structure


The SRE team is broken down into five distinct sub-teams:

• Livesite
o Spread across three regions (Redmond/Dublin/Suzhou) whose OCEs take responsibility
for Farm/Datacenter/Regional level outages.
• Incident Management
o Develop, implement, and support best in class incident management practices across
SPO.
• Insights
o Analyzing Petabytes of ODSP data and telemetry to surface insights that drive reliability,
performance, and efficiencies back into the service.
• Customer Response Team
o The last line of support for Microsoft’s largest SPO customers, leveraging SRE practices to
drive optimal user experience.
• Performance & Debug
o Deep diving into the most difficult of SPO outages and investigations ensuring durable
improvements make their way back into many areas of the product.

Understanding SPO v2023 108 Microsoft Confidential


16.3 Incident Response – Follow the Sun
SPO incidents of severity 1 or 0 are run by the on call SPO Incident Manager. The IM is supported on the
incident by one or more SRE OCEs. Coverage is spread across the globe in 6 – 8 hour shifts to ensure timely
service restoration is the responsibility of engineers who are awake and in their business hours.
For Farm level Sev2 alerts, the SRE Livesite OCE is engaged.

Redmond Dublin IDC Suzhou

Incident Manager Coverage Livesite OCE Coverage

16.4 Responsibilities and Managing the Service


Essentially ODSP SRE is responsible for service reliability, incident prevention & response, and no repeat
impacts. This is achieved through multiple approaches:

16.4.1 SRE High Level Objectives


In every planning cycle SRE review the business needs and adapt our approach, staying in line with the
principles of Site Reliability and focusing on the following Objectives:
• Objective 1: Service operates according to expected Service Level Objectives - Reliability
• Objective 2: Service is scalable and maintainable, and we keep customers and a happy team in
the process - Sustainability
• Objective 3: Drive Product discovery and research to improve systems, processes, or the product
- Innovation

Understanding SPO v2023 109 Microsoft Confidential


16.4.2 SLIs / SLOs
Service Level Indicators (SLIs) define what the reliability of a service is by numerical indicators which can
then be accurately measured over time. These directly indicate the health, availability, and performance
of a service with metrics such as latency, throughput, and errors/failures per X number of
requests. Examples:
• Farm Availability % - Detailed list
• Farm Reliability (QOS)%

Service level Objectives (SLOs) are typically defined as a SLI threshold / range: A range of values of a
service level indicator that needs to be maintained for the service to be acceptably reliable.
It is recommended that the SLOs defined using these SLIs need to be specified as SMART goals: In that
they need to be Specific, Measurable, Achievable, Relevant, and Time-bound, for precision and ability to
be clearly measured.

At a high level for SPO, we are targeting availability at different indicators depending on the scenario.
Farm availability is expected to be at 99.99% and latency below 1000ms. Incident Severity is determined
by factors such as failure rate and impact duration.

16.4.3 Observability
SRE farm level monitoring is scoped to 4 buckets:
• Availability and Latency using active monitoring.
• Outside-in availability based on a 3rd party service.
• Passive monitoring based on SPO usage data.
• Database down monitoring based on probes and performance indicators.

Availability and Latency – Active monitoring


• Utilizing Azure Runner tech to probe Synthetic/XamHB (fake) SPO tenants which live in every
content database worldwide.
• Success and speed of page load/render is recorded.
• Mimics the experience of real users residing in the same database.

Classify, Escalate, Notify (CEN) Matrix: https://Aka.ms/ODSPCEN

Uptrends – Active monitoring


• 3rd Party SaaS solution used for HTTP request monitoring. Independent of Microsoft
infrastructure.
• Probes are randomly running from different world locations hitting XamHB tenants
located in different SPO farms. The platform alerts when checks fail 5 times.

QOS – Passive monitoring


• Monitors that track the results of real user requests in Geneva QOS metrics. Alerts at two
scopes:
o Farm Health

Understanding SPO v2023 110 Microsoft Confidential


o Tenant Health

Database Down Probes


• A recurring process running from BOT servers probes each content database in a farm
regularly.
• Results are aggregated and will alert at Sev 3/2/1 based on duration of downtime and
ability of automation to mitigate.

16.4.4 Toil Automation


Reducing toils is essential in scaling support for hyper scale cloud services like ODSP. Particularly
relevant for our service are operability challenges in Airgaps and Sovereign clouds where manual
touches must be kept to a minimum. Here are a couple of examples of how the ODSP SRE team
implement this goal.

- Reducing manual touches in the environment.


o Repetitive manual incident mitigation operations are replaced by automation.
o SRE drives toil reduction for all teams across ODSP in areas such as replacing manual
Merlin job execution with an automated equivalent.
- Ticket Enrichment
o An application based on Geneva Automation framework that injects critical information
into alerts to aid quicker mitigation
- Node Health Management
o Automating tasks such as bringing unresponsive servers back online and creating
hardware tickets for Datacenter Ops
- Auto Failover

Understanding SPO v2023 111 Microsoft Confidential


o Automatic SPO farm failover to recover service availability.

16.4.5 Root Cause Analysis


Every alert we receive presents a chance to improve the service. Either there is a real issue in the
product or infrastructure that needs to be addressed or monitoring can be improved. The key to
capitalizing on these opportunities is diligent follow up on every outage, completing the root cause
analysis that enables repair items which push fixes back into the service.

SREs responsibility here is many folds:

- Complete RCA in areas of SRE expertise like SQL / Network / Performance and generate
descriptive repair items to the relevant feature teams.
- Partner across ODSP/M365/Azure feature teams on RCA investigations that are ambiguous or
span multiple areas of responsibility.
- Reporting in different forums such as Livesite Health Review and Monthly Service Review
surfacing key metrics like
o Alerts trends
o Recurring issues
o RCA completion rates
- Postmortems
o Larger scope incidents require a postmortem to be completed which is driven out of
SRE. Part of this process is a follow up mechanism which ensures repair items which
were committed to are delivered.

16.5 SRE Tooling


The SRE team leverages a 1ES approach where it is possible to develop the tooling required to support
and improve the reliability of the service. Some examples include:

16.5.1 Geneva / ICM Automation


o How is it used by SRE?
▪ Ticket Enrichment & Component Health
▪ ICM Alert aggregation and management
▪ Auto Failover
▪ Auto healing
o Documentation
▪ Ticket Enrichment Implementation
▪ Geneva/ICM Automation
▪ Geneva Actions

Understanding SPO v2023 112 Microsoft Confidential


16.5.2 Merlin
Merlin is the primary client used by On-Call engineers to manage the SharePoint Online service.
SRE have extended Merlin in several areas to develop tooling to aid in fast mitigation and
investigation of the service.
o Periscope is a Merlin function that provides all the information an OCE needs while
responding to an availability or QOS outage on a Content farm. Its abstracts away specific
more long-winded standard merlin commands into short, easy to remember aliases which
output farm data in an easy to consume and logical format.

Understanding SPO v2023 113 Microsoft Confidential


▪ Try it in a Merlin window: start-periscope.ps1 -label <farmlabel>
▪ User guide

16.5.3 Geneva Metrics/Logs


Metrics and logs are key to the successful operability of the SPO service. SRE are focused on
specific areas here like
o Dashboards
Dashboards tell us what is happening in real time around the general health of SPO farms
and their dependencies. We look at everything from availability and reliability charts to
specific component level performance counters. It is important to us that we have a view
from node level all the way up to globally aggregated data
http://aka.ms/spoavailability
http://aka.ms/spooutage
http://aka.ms/SPOQoS
o Monitors
Our monitors and alerts are primarily driven from Geneva metric data (aka MDM). The
hotpath SLA provided by MDM ensures that we know within 1 - 2 minutes that there is
an outage.
o Logs
Geneva Logs are one of the main resources which drive Root Cause Analysis of Livesite
incidents. Some log sources are specifically important:
I. ULS – SPOs application log
II. RequestUsage – Location where every request to SPO is logged.
III. Server Event Logs – Standard windows event logs
IV. Monitoring probe detailed results – Results from synthetic availability and
databases probes

16.5.4 Kusto / Partner Telemetry


SPO depends on many first party services to run different parts of its technology stack. It’s crucial
that we have a view into the health of each of these dependencies and their health. A lot of this
data comes from access to Kusto clusters owned by each of these organizations. For example:
o Azure Network
o Azure Blob
o Azure SQL
o Redis
SRE leverages this data to make automatic determinations at incident time about what
components in the stack are failing. This data is also central to enabling full RCA which reaches
into these areas beyond the control of SPO.

Understanding SPO v2023 114 Microsoft Confidential


Chapter 17 Acknowledgement

The following people have contributed content to this book, in alphabetical order.

Nick Alfeo Haoran Huang Corey Roussel


Saurav Babu Ankur Jauhari Michael Schulz
Noa Barsky Curtis Johnson Patrick Simek
Manjari Bonam Shruti Kasetty Jeff Steinbok
Tyler Boone Junning Liu Matt Swann
Kevin Chan Bob Ma Randy Thomson
Le Chang Tamir Melamed Nidhi Verma
Ronak Desai Akshay Mutha Ziyi Wang
Bhavesh Doshi Mayur Naik James Walsh
Manu Gautam Rahul Nigam Edmund Wong
Andy Glover Dylan Nunley Pengfei Xu
Burra Gopal David Oliver Jun Zhang
Joe Hamburg Derek Park Weiye Zhang
James Harriger Paulo Rodrigues Mengfan Zou
Pete Harwood Wesley Zheng

Understanding SPO v2023 115 Microsoft Confidential

You might also like