Understanding SharePoint Online v2
Understanding SharePoint Online v2
SharePoint Online
2023 Edition
SharePoint Online and OneDrive for Business is a living system, and as such, our documentation lives as
well. We intend to refresh the content from time to time. Any errors, omissions and suggestions for
future contents should be directed to [email protected] or [email protected].
In addition, authors have included more links to specific areas throughout this document. Hence,
viewing this document online has a unique advantage over a printed copy.
1.2 Glossary
The following are a few terminologies which will appear a lot in this document. The brief definition gives
the reader a quick idea what they are and makes the context more meaningful. The relevant chapters
will provide deeper descriptions of these concepts.
Grid Manager - the central controller responsible for deploying and maintaining the SPO service.
EDog, SPDF, Prod Bubble, Prod – Deployment rings starting from the innermost one
Tenant – Represents a customer in SPO, usually a company with varying number of users.
Content Database – A type of database which stores the metadata of customer contents
One Drive for Business – A type of SharePoint site specially designed for personal use.
SharePoint Team Site - A type of SharePoint site specially designed for team collaboration.
As of September 2022, SPO service runs from many data centers across the globe. It has more than 2 Eb
of user content. It serves over 250 million monthly active users, with a peak daily RPS over 4 million.
Tenancies typically have a domain name of the form “companyname.sharepoint.com”. Tenants can also
associate site collections with “vanity” domain names, so that public facing site collections can be
addressed via either “companyname.sharepoint.com” or simply “companyname.com”.
The scale unit of SharePoint content is called a site collection. Each tenancy can have up to 2,000,000
site collections with some exceptions. Tenant administrators can assign users varying levels of access to
each site collection individually.
The Service Level Agreement (SLA) promises 99.9% uptime to customers, with financial penalties if the
service falls below that level of reliability. “Uptime” is essentially defined as the service is fully accessible
with no reduction in functionality. Read-Only time is considered downtime. SPO has established a
dedicated SRE team as the first line of defense against any service instability. If there is anything the SRE
team cannot resolve on its own, the engineering team area experts will be engaged. SPO also has
monitors which constantly try to access synthetic heartbeat tenants in every database and will page OCE
when any farm’s availability dips below 98%.
SPO service builds many layers of redundancies. Two identical copies of user content are saved to Azure
blob service in different Geo locations. Databases are also replicated to different Azure availability zones
to ensure minimum loss in the event of a disaster. The service builds deployment units in pairs so that
user traffic could be switched to a different set of servers when problems occur in one data center. The
service also ensures that data copies are available to allow customers to request restoration of
accidentally deleted data to any point of time in the recent past.
SPO is also working to obtain various certifications, including FISMA, HIPAA, ISO27001, and SAS70. As
part of these certifications, the service must be able to provide audit logs to customers of everyone who
has accessed customer data or Personally Identifiable Information within their tenancy.
1.4 Dependencies
SPO has dependencies on many partners Office 365 and Azure services:
1 Azure Active Directory. AAD is the central Office 365 repository for tenant and user identities. All
information in SPO about tenancies and users is replicated from AAD. AAD also provides
authentication for SPO users. The “Identity” Chapter has more information about this.
2 Azure Front Door. AFD handles many chatty HTTP traffic from a server which is very close to the end
user which greatly boosts network performance and reduces latency. The “Routing” chapter has
more information about this.
3 SQL Azure DB. SPO stores user content meta data in SQL databases. The “Storage” chapter has more
details on how SPO utilizes this service.
It is particularly important that engineers understand the telemetry necessary for their features. As
described in later chapters, engineers are generally prohibited from attaching debuggers in the
production environment. Therefore, all tuning and debugging must be done via telemetry and logging.
For each of the horizontals, designs that conform to existing idioms in the service can be expected to
rely on existing mechanisms. Designs that break existing idioms must account for implementing the
necessary functionality to satisfy the horizontal requirement.
2.1 Secure
Design Point: Assume malicious users. SPO must defend our infrastructure and our customers from
malicious tenants, malicious users, and malicious engineers. We must guarantee to our customers that
no one can access content they don’t have rights to, whether the attacker is an authenticated user or an
unauthenticated user. This must be true within a given tenant and between tenants.
Design Point: Assume that security measures will fail. Malicious users will figure out how to execute
code on front-end servers. Therefore, we must impede the attacker’s progress by running local services
with least-privileges and limit the attacker’s impact by ensuring that the server’s identity can impact
fewer than 1% of tenants. No component of the system should ever fully trust another component in
the system.
Design point: Assume that Microsoft engineers are malicious. Engineers must never have access to
Customer Content without manual approval from ODSP leaders. When approval is granted, it is on a
tenant-by-tenant basis. Customer Content must be encrypted before storage such that engineers with
access to the storage location cannot view clear-text content without going through SPO’s gated access
mechanism. This must be true of all services that handle SPO Customer Content, including partner
teams and Azure dependencies.
Design Point: It is impossible to parse binary files without introducing exploits. All code that parses
binary files must run in a sandbox and should “gatekeep” by rejecting invalid or malformed input where
possible.
Design Point: Never allow users to consume infinite resources. Users may accidentally or intentionally
attempt to consume datacenter resources to induce a denial-of-service condition. No user should ever
be allowed to consume sufficient resources to degrade the experience of other users.
Design Point: Audit everything, both Microsoft actions as well as customer actions. This allows SPO to
meet our compliance requirements and enables customers to perform self-service investigations
without engaging Support. Operations taken by engineers in our system must be approved by a peer or
leader and are always logged and audited.
Design point: Features must meet SPO’s security, privacy, and compliance requirements from day one.
These properties must be designed-in from the start by the feature owner, not bolted-on after shipping
or outsourced to another team.
Design point: The buck stops with us. While we take a One Microsoft approach to leveraging the best
people, processes, and technology to achieve our security goals, the ultimate accountability for
protecting Customer Content rests with us and cannot be outsourced or delegated. Our aim is for SPO to
be the safest cloud for the world’s most valuable data.
Design point: all management operations must be transactional or idempotent. Usually, you cannot
make any action involving more than one machine truly transactional. Therefore prefer to make
management code idempotent, meaning that assuming x is the state of the system, then fn(x) =
fn( fn(x) ).
Design point: Devise a strategy to deal with failures and auto mitigate wherever possible. Retry is often
used as a mitigation, but one must not retry forever. Most failures are intermittent. Therefore, retry
frequently is successful. However, infinitely retrying is always a bad idea; it creates a hang, consumes
resources, and can lead to denial-of-service attacks.
Design point: be humble about writing automatic failure handling code. Handling failure conditions is
incredibly hard to get right, and typically even harder to test. Come up with a failure handling strategy
for your feature at the design phase. And all strategies must be as simple as is feasible. Complex failure
handling strategies often end up introducing more potential failure points. For major failures that can
impact the service, don't be afraid to inject a human into the loop where judgment calls are necessary.
2.3 Deployable
Design Point: Never hardcode configuration settings. The service is deployed in many different
environments, from local-on-machine Sandboxes to EDog to Production to Sovereign clouds. Farm
stamps are different in each environment and will change over time. All configuration-specific
parameters must be read from a deployment configuration file and must never be hardcoded in scripts
or Grid Manager.
Design Point: Write managed job wherever possible and keep PowerShell code at a minimum.
Deprecating them if you could help it. PowerShell is a scripting language and does not have the support
infrastructure of a full-fledged programming language. It can be useful at times due to its late binding
but becomes cumbersome when complex logic is needed.
Design Point: Upgrades must involve service zero downtime. We will not take the service offline when
upgrading components, though we do allow for limited read-only windows (a few minutes each month).
All components must be capable of upgrading without going offline. All components must deal with
other activity in the service during their upgrade.
Design Point: Upgrade must be fully parallel. It must never be the case that only one of an entity (farm,
VM, network, etc.) can be upgraded at once. We will soon have too many farms in the service to do
anything one at a time.
Design Point: Mixed versions of components are normal. Every component must deal with multiple
versions of related components running at once. Grid Manager and Farm stamps are upgraded at
different times. The farms within the same farm stamps are also upgraded with an intentional time lag
in between. This process can take days to weeks. Every component will run most of the time with mixed
versions of other components in the system.
2.4 Manageable
Design Point: Lights-out management. Do not assume that engineers can intervene in management
activities in the normal course of events. Do not assume any live person is actively monitoring capacity
or reliability. Assume command and control are down during serious incidents. Design mitigations to be
automatic and self-contained.
Design Point: Codify and test service cross-version compatibility contracts. Services are never upgraded
in lockstep, therefore cross-version compatibility must be perfect. The contract must be codified, and a
test architecture must be used to enforce the contract
Design Point: Critical situations need to raise alerts with appropriate severity levels. Pageable alerts
need high quality TSG.
Design Point: Use telemetry to understand and optimize the service. SPO has a rich set of telemetry and
logs saved in COSMOS, MDM and MDS. Study both the synthetic test traffic and real user activities.
2.5 Testable
See the Validation section in the SPO Engineering chapter.
2.6 Debuggable
Design Point: Assume all debugging will occur via log inspection. In general developers cannot get
access to machines containing customer data in production. Design your logs such that you will have all
the information you need to debug remotely using only logs.
Design Point: Support dynamic logging levels. When debugging issues, you usually have too much log
data or not enough. Your life will be easier if you can dynamically enable or disable specific component
logging or verbosity without deploying new code.
Design Point: Utilize the sandbox, especially the local sandbox which lives on engineer’s own machine.
2.7 Scalable
Design Point: Use the content DB. The content DB has a pre-built solution for high availability,
backup/restore, and disaster recovery, along with a solid partitioning story that will work well in SQL
Azure. Do not add a new database role without an overwhelmingly strong reason.
Design Point: Grid Manager must not be critical path for user latency. No code should call into the Grid
Manager in a way that impacts the end user’s perceived latency. Assume Grid Manager is very far away
with high latency (as is the case in all datacenters outside North America).
Design Point: Tenants and data move. As SQL servers, farms, and networks become unbalanced,
tenants’ data will be moved to restore balance. Do not assume data will stay where it is originally
placed, and do not preclude quickly moving data from one place to another. Do not introduce cross-
service dependencies that complicate moving data from one database to another.
Design Point: Be careful with local optimization. Frequently optimizing one part of the service can cause
bottlenecks in other parts of the pipeline. Be sure to keep the overall design of the service in mind when
optimizing articular components.
Design Point: When adding a new sub system, consider designing for large-scale and reduce resource
waste due to too granular partitions.
Design Point: I/O operations are the most expensive part of storage. Data at rest is significantly cheaper
than data in motion. Reduce SQL I/O to the bare minimum.
Design Point: Don’t raise the cost per user. No single feature may raise the per-user cost of running the
service.
Design Point: Zero cost for inactive tenants. Many of the tenants in the system will do little with it. Drive
the cost of those tenants as close to zero as possible by driving their resource consumption as close to
zero as possible.
Design Point: Prefer failover. If the system is misbehaving, failover by default to a known working
system.
Design Point: Keep the service up at all times. An error response is better than no response. A successful
response is better than an error response. Those signals prove useful in debugging what is wrong with a
service.
Design Point: Dependencies that are key to the operation of the system need a failsafe, e.g., AAD or
DNS. SPO should not go down hard if a partner dependency goes down hard. For dependencies that are
ancillary to the system’s functionality, SPO should continue to work if they go down.
This chapter does not aim to be a comprehensive manual or background on all things Engineering. I
encourage you to bookmark https://aka.ms/spogithelp as your starting point for all documentation on
the SharePoint Online Engineering system as well as how to get started. Along with that, please make
sure you join ODSP Engineering on Teams (https://aka.ms/odspeng), and contact us in the SPO
Troubleshooting channel.
Your primary point of entry for any SPO-specific operations within the codebase should be the dev
command. Run dev without any parameters to see a list of what it can do; similarly to git.
In ODSP, we are highly committed to building a highly trusted product; this is one that is highly reliable,
scalable, and cost-effective. To that end, we have policies in place to ensure that your changes are
healthy, and we have mechanisms to control the rollout speed of changes. Read on to learn more about
these.
Finally, the Engineering Systems in ODSP are supported by the Engineering Fundamentals team. If you
have other questions, feel free to reach out to us at [email protected].
3.1 Coding
SPO code is in Git hosted in AzureDevOps. Given the size of our repository, we have the luxury of not
needing to use add-ons such as VFS for Git or Scalar. We just use vanilla Git and as such you can use
whatever Git compatible tooling you want.
SPO is built using a mixture of langages: C#/C++/PowerShell for product, C#, PowerShell, and some Perl
for internal tooling.
SPO has first-class support for Visual Studio Enterprise and the CMD and PowerShell command shells.
You are, of course, free to use whatever text editor or Git tools you like.
All coding must be done in user-created topic branches and pushed via a Pull Request.
3.2 Build
3.2.1 Last-Known Good Builds
SPO has a notion of “Known Good” builds. It is recommended to branch from a Known Good (KG)
branch for your build to succeed, while main HEAD is also buildable in most scenarios (If main HEAD is
not buildable in your scenario, please branch from a KG).
Known Good (KG) Git branches are created several times every day, and are named:
origin/build/main/<checkpoint version>
The Last Known Good (LKG) branch is always named origin/build/main/latest. You can list the current
LKG and previous KG branches by running dev lkg list.
To take advantage of this you run dev scope add to specify which projects are built locally; more
projects are also built locally if they have local changes or have changes between local branch and the
nearest KG; other projects will be imported. You will want to check with your team to understand which
projects to pull in.
If you would like to only build projects added by dev scope add, run dev scope set AddStaleProjects
false. If you would like to know the status of each project, run dev scope status -g and/or dev scope
blame.
This does imply that if you change anything in the NMAKE files, you must re-generate the DScript and
CSPROJ files.
How to build?
Easy!
Note to the reader: this section is pretty dense and contains a lot of jargon. We attempt to define terms
here:
PIP PIP(Primitive Indivisible Process) is the smallest unit of work tracked by BuildXL's
dependency graph. Generally these are process invocations but may also include other
primitives like WriteFile or CopyFile pips, Service and IPC Pips.
DScript DScript scripting language is used by BuildXL as one of its front-end build specification
languages. DScript is based on TypeScript.
MSBuild MSBuild(Microsoft Build) is a platform for building applications. MSBuild provides an XML
schema for a project file that controls how the build platform processes and builds
software. Visual Studio uses MSBuild to load and build managed projects, but MSBuild
doesn't depend on Visual Studio. The project files in Visual Studio contain MSBuild XML
code that executes when you build a project by using the IDE.
DBuild's job is to write DScript (new name for DominoScript) and to call BuildXL (new name for Domino)
to run that DScript. Initially, dbuild does not know anything about your enlistment, including which
projects you are enlisted in. Turns out that trying to scan all of your source tree to determine what to
build is expensive. So there has a process to do this for us (metagraphcreator.exe) which scans your
enlistment, finds which projects you are enlisted in, and writes the DScript needed to parse the
NMake/Sources files for those projects. Since metagraphcreator takes time to run, DBuild writes dscript
to call metagraphcreator, and calls BuildXL on that Dscript. Therefore the speed to determine what you
are enlisted in in the median build case is the speed at which BuildXL can figure out something has
changed.
Once we have called BuildXL for the Enlist Build, we have Dscript for each project which knows how to
run nmake/msbuild. However, we still don't know what are the inputs and outputs to each compile, link,
etc... So we run nmake/msbuild is a special mode in the metabuild to read the make logic but not
execute it. Instead, it writes out what it would do to a file which we call the Static Graph, and we convert
the static graph into dscript. That dscript is then executed under a third call to BuildXL call the Product
Build.
The end goal is to build just the PIPs that you have selected by giving arguments to DBuild. Each PIP has
tags on it corresponding to the build directory it is apart of and the project it belongs to. BuildXL allows
you to filter to specific specs using a boolean string filter. For example, when calling “dev bi”, we will
filter to 'dir:d:\SPO\dev\sts' if you build from the d:\SPO\dev\sts directory. Any pip in the d:\SPO\dev\
sts\stsom directory would be tagged with 'dir:d:\SPO\dev\sts\stsom' as well as 'dir:d:\SPO\dev\sts' and
'dir:d:\SPO\dev', so building in any of those directories would pip up the pip.
In order to build a PIP in the Product build, we have to ensure that the DScript is available to read and
work from. This means filtering to the pip which produces it in the metabuild, and any DScript it might
need. Metabuild pips are grouped by project, so if you want to compile one file the STS project, you
need the Dscript for the STS project, as well as any upstream projects STS might import from. Therefore,
the metabuild is filtered to the projects you care about + upstream projects. Notice that if you are
enlisted in 100 projects but choose to only build the top projects, it will never even look at the
Nmake/Msbuild build logic for the other parts.
Determines which projects are in your Each pip works on the There are no filters
Enlist Build
enlistment entire enlistment used
Ben Witman gave a talk describing the internals of BuildXL including how cache lookups and sandboxing
is done. Recording is here - https://msit.microsoftstream.com/video/9fc52d3a-390d-4e77-b201-
e382cfa55408?list=user&userId=2672a5d1-cf09-45ae-96d5-e97a23a2edcd
3.3 Validation
SharePoint Online leverages a “shift left” philosophy (with “left” being closer to the moment of change
and “right” being closer to the end of deployment) with the goal of running most validation as a gate in
SPO.Core supports 3 main classes of tests. Lower classes of tests are more reliable and faster. Having
said that, we acknowledge that lower classes of tests may be harder to write. Given the number of tests
we have and growth we anticipate, this speed and reliability at run-time is far more impactful than the
added time it often takes to write them. Please try to write as much as possible as unit tests. If you have
questions on how to do this, or your area presently does not have any, reach out to the ODSP
Engineering Fundamentals team.
L0 Tests Tests that run rapidly, make calls directly against compiled assemblies, do not require
deployment/installation, and have no external dependencies or state. Also commonly
referred to as “unit tests”. In SPO, these run in QTest as part of the build. The test harness
is MSTest. (This will move to MSTestV2 in CY2023.)
L1 Tests Tests that execute against compiled assemblies and require SQL state. These do not take
external dependencies or require the product to be running. For these tests in SPO, we
execute a full dev deployment, because this is how SPO SB state is deployed. These run in
CloudTest using MSTestV2 with VSTest.
L2 Tests Tests that execute against web APIs. These require a full dev deployment and require that
the product is running. These may take limited external dependencies. These also run in
CloudTest using MSTestV2 with VSTest.
L3 Tests Tests that execute with full dependencies and a full environment. In SPO these would run
against SPDF or MSIT. EFun currently has no support for these, though some teams do
run L3 tests that they manage.
L0 tests can be run directly on your build machine. L1 and L2 tests can be run against a deployed
Sandbox.
1. Code must be tested. Plan for testing as part of each feature. Do this hand-in-hand with the
feature design rather than after the fact to ensure that your design is testable and that testing is
funded. Grafting on testability later is generally considerably more expensive.
2. Tests are treated as product code and should get the same level of care as any other product
code. Tests must be reliable, must be correct, and must be high quality code. Tests must also
have clear owners.
3. Ensure that others can test on top of your feature. If you are building a feature in SPO, you need
to test your own feature but you must also support that consumers of your feature can also test
reliably.
4. Prefer smaller / simpler tests. In general, L0 is better than L1 is better than L2. Smaller tests are
faster and more reliable. Cover most functionality with L0 whenever possible. Fill in gaps and
cover more complex integrations with L1 and L2 tests.
5. Make the product testable. Quality product engineering accrues to testability. Tightly coupled
code, long dependency chains, and unstated assumptions/contracts all make it much more
complex to test. Testable code is a feature on its own, but also generally indicates better -
engineered code.
Testing is key for product quality, and test reliability is key for engineering experience. Low
reliability tests create pain for every engineer who runs them. EFun tracks reliability and is actively
driving up reliability through flaky test management (including disabling low-reliability tests).
More details and guidance are in the validation wiki. There is also an SPO test authoring training
available, including a recorded presentation.
The SPO Test Authoring channel in the ODSP Engineering Team is also an excellent resource for
writing and supporting tests.
3.4 Sandbox
Sandbox is an internal-facing Azure service that hosts ad-hoc integrated testing environments for over
750 ODSP engineers who depend on it daily to increase their productivity.
Sandbox is a highly available, performant, worldwide service. It is deployed in Azure Functions, built on
top of Azure DevTest Lab and tightly integrated with Azure DevOps. It provides safe environments with
isolated test deployments of services, allowing engineers to develop, debug, and perform integration
testing in a simulated datacenter.
With a Sandbox checked out from Azure DevOps UI, engineers can:
Sandbox service now provides more than 10 different configs, including basic SPO-Onebox (SPO all
server roles packed in a single VM), Multi-Geo Sandbox, SPO-OneBox with Prod AAD, etc., and several
latest versions for each config. Additionally, Sandbox service provides customization capabilities for
engineers to customize by provisioning self-defined or Sandbox-defined artifacts upon checkout.
We recommend using dev pr, which makes a Draft PR first. You can then publish it from AzureDevOps.
If you want to create a “buddy build” which doesn’t ask for reviewer sign-off, but runs the policies, you
can make Draft PR.
In addition to direct Pull Request monitoring, EFun also monitors rolling builds and CIT runs for quality.
These provide visibility into engineering blockers and also help pinpoint new issues when they are
discovered. Below are the primary builds and CIT runs:
• YAML CI Build - OfficialCITReporting – This pipeline matches the build in PR. It runs the full
build matrix for SPO.Core, including SideVar. This also runs L0/unit tests in QTest. This is a
batched build on the main branch that runs multiple times per hour.
• Git CIT PR Validation (Official) - BASELINE – This release matches the CloudTest runs in PR. It
runs all L1/L2 tests that are part of PR. This runs every hour using the latest rolling build.
You can find the full list of CIT releases in the Azure DevOps Releases page. Search for “CIT”. There are
numerous CIT runs that cover the Sidevars, search, project, etc.
Currently Dogfood (SPDF) releases are not gated on the rolling build or CIT. They serve exclusively as
advisory build/test runs.
• All rollout follows ring based deployment (rollout audience is Office dogfood, Microsoft.. to several
waves in production)
• Rollout is gradually starting with smaller updates in beginning (1%, 10%..) and ramp up more as we
deploy broader
• The rollout policy is enforced in the code. Standard Rollout must take a minimum of 14 days for WW
rollout and Hotfix rollout must take 48hours.
• Any deployment action must not impact more than 20% of ODSP footprint (fault domain)
• Engineers can leverage override workflow and get GEM approval to deploy faster via the rings
• Rollout from 0% dogfood to 100% Prod takes 14 days at the least (no deployment on weekend)
• Changes deploy from cold to hot instances.
• Leverage QoS signals to auto-stop deployment, where available
• There is visibility into changes deployed in central tools to help SRE and OCE (eg ICM change card)
ODSP changes to Monolith, Web , Sync Client, Mobile Clients Grid Manager, Microservices are using
different shipping mechanisms, but they all follow the same Safe Deployment Principals.
Rings of Deployments – ODSP phased rollout process starts deployments to smaller audiences before
deploying broadly.
World Wide Production – M365 Public, Sovereign and Gov customers experience the changes. This ring
is further sliced into different stages (aka waves) to reduce the impact of any regression. ODSP
Customers have ability to select individual users/tenants to get early experience to new features with
M365 Targeted release program.
Tuesday was chosen as ship day because it is in a sweet spot during the week. Most of the weekend
code changes have been committed and have had some bake time in SPDF. Also there is enough of the
work week remaining after the production fork is created on Tuesday to be able to handle any hot Stop-
The-Train type issues for the production fork before the weekend. Tuesday is also traditionally the day
with the lowest number of code changes committed to main, reducing the chance of merge conflicts or
other entropies getting introduced in the fork.
Components such as Grid Manager similarly ship to production every week, aligning with, but shipping
independently of, the monolith. There is a desire, for the sake of simplicity and determinism, to ensure
that all components that collectively make up the OneDrive/SharePoint Online service ship weekly, in
alignment with the monolith. This does not include hotfixes – fixes in response to some hot customer
need – which ship faster so that critical incidents can be fixed rapidly.
1. SPO has a LOT of farms. You can filter to a specific farm or set of farms.
2. Status icons – green checkmark = done, blue spinner = in progress
3. Direct link to the AzDO pipeline containing artifacts being deployed.
4. Due to the way SPO ships, it’s not guaranteed that a single build is the one that first one
deployed to a location that contains the relevant commit – this manifest as a dropdown to let
you view all of them.
3.7.6 Kill-Switch
KillSwitch is a lightweight mechanism that can be used to rollback code change if a regression is found. It
addresses a very specific scenario and is not intended to replace the SharePoint flighting system.
• There is no manual deployment workflow; by default, the new code is executed for every
environment.
• If a regression is found, a KillSwitch can be "activated". This causes the old code to execute, which
we assume will revert to a "known good" state. (The older a KillSwitch is, the less valid this
assumption will be.)
• A KillSwitch can be activated at farm level, environment level (all farms within a given environment),
or globally (all farms in all environments). No other granularity (e.g. per tenant) is available.
• After a killswitch has been activated/deactivated at a higher scope, it cannot be updated at smaller
scope. e.g., if you have activated globally, you cannot deactivate in edog environment.
• KillSwitches must NOT stay activated indefinitely; they are a short-term mitigation until a root cause
bug is resolved.
•
In general, new features and significate code changes should use the flighting infrastructure to gradually
roll out. Code changes with high confidence (thus can be undark by default) but need to be guarded
against unexpected consequences should use KillSwitch. More details at aka.ms/killswitch
Flighting is recommended for shipping new features dark and can be used for light up a feature against
SPO User, SPO Tenant and SPO Farms scope, Flighting follow all Safe deployment principles and
leverages the policies and procedure. Flighting API are available in Monolith, Web, Grid Manager and
other repos across the board and Flighting UX and Merlin interface is available for engineers to manage
lifecycle of a flight.
3.7.8 Config
SPO also uses centralized config system for deploying configs across Worldwide footprint of service
following Safe Deployment Policies. Typical configs span from simple name-value pairs to a schematized
collections of metadata such as background jobs registration and scheduling, traffic identification hints,
or SKU classifications.
3.7.9 SideVar
SideVar is a mechanism that SPO uses to deploy product-wide changes, including but not limited to:
large code refactors, .NET framework upgrades, compiler upgrades. Broad changes like this are unable
to be put behind a killswitch or a flight; we need a separate mechanism both to rollout these crucial
changes and to allow rapid recovery in case of failure. SideVar currently has 4 pre-defined slots for
distinct payloads (in addition to the baseline, so 5 total). Payloads are restricted to things that can be
modified in the product build itself (i.e. not OS variants or drivers), by changing a build flag. Payloads are
then deployed across a thin, but growing, slice across production farms. At any stage in the deployment
policy, a payload can be graduated, and folded back into the baseline as a regular commit. SideVar
includes API monitoring and alerting; any API on a payload that shows a QoS deviation from the baseline
by more than 5% gets a Sev3 alert, which goes to the team that owns the relevant payload.
3.7.10 InfraVar
Today a Large category of infrastructure or platform changes (scoped to VM level) are rolled out
manually due to lack of an automated solution. In addition, no good solution existed to deploy changes
to network driver and TCP/IP stack at a virtual machine level. Examples include Domainless changes to
the Sharepoint topology, VHD, Updating OS of Fleet, Deploying No GACs. InfraVariant is the tech to
support VM Based Flighting for 100% of SP VM’s for Grid and SharePoint. More details in InfraVariant
RunBook.docx
We created a fixed schema set for ODSP change and mapped it to FCM’s fixed schema in SDK. The
changes are automatically ingest changes from registered change sources using a SPOFCM Nuget
package
Azure Graph
High level:
Progress:
• We can correlate 70% of ODSP changes that cause regression and displayed in ICM Change card
(Site down)
• 95% of changes causing regression are displayed in central Power BI dashboard < 15 mins
(aka.ms/changeoptics)
• Can corelate 3 partner changes: SQL Azure T-Train, Phynet and TOR
• Our learnings presented to Azure LT (Jason Zander’s team) and working with FCM team to drive
this to be #OneMicrosoft tech (pushing this to be part of M365 Unified Fabric Effort (UFF)
workstream)
Racks are purchased as part of a Zone. A zone usually consists of a fixed number of Racks, connected via
an Agg Router. Zones are clustered colocation centers (colos) within the datacenter. A colo is a set of
racks that can be physically cabled together to provide connectivity without going through datacenter
switches or routers; Intra-Colo connectivity is significantly better than inter-colo, which is better still
than inter-DC. Therefore, SPO service requires its racks to live in the same DC as the Azure services it
depends on. The datacenters which have the largest SPO footprint include San Antonio, Dublin,
Amsterdam, Sidney and many more. SPO also start to have presence in small regional data center such
as South Korea, Qatar, Switzerland, etc.
Every machine is connected to top-of-rack switches (TORs). Racks are networked together by redundant
L3 aggregation switches. Load balancing is provided by third party hardware load balancer running in
active/passive pairs. The devices are primarily used to direct inbound connections from customers to
the correct SharePoint farm.
Machines are assigned to virtual LANs (VLANs). Machines from the same zone could be assigned to the
one or multiple VLANs). Physical machines and the VMs they host must be on the same VLAN.
Latency within the datacenter is largely sub-millisecond. However, bandwidth decreases as the network
distance between machines increases. Machines on the same rack have the highest bandwidth
connection, then machines on different racks but in the same colo, then machines in the same
datacenter but different colos, and lastly machines in different datacenters have the lowest bandwidth
connections.
Most of the time, the user requests will go through the Microsoft Azure Front Door service before
reaching the SPO web servers. This AFD provides performance boost to the user request. Once reaching
the SPO web server, the user request will be processed by a set of dedicated servers called the USR.
These web servers will call into other dependent services such as Azure Redis Cache, SQL Azure or Azure
storage to obtain the information necessary to serve the user request. Once it gathers all the
information required, it assembles it into an HTTP response and return it to the customer.
Once the user content is stored in the SPO service, many internal workloads will process that content.
For example, the Search service will extract the keywords and build search catalogs out of it. The MeTA
service will build image thumbnails and video previews for customers to consume in the future. Various
jobs will run on the BOT machines at the present time interval.
All physical machines and virtual machines are joined to the a few Active Directory domains, such as
YLO001 domain. These AD domains are different from the AD forest used to store tenant accounts
which are called SPODS domains. Two-way trust is established between the resource domain AD servers
and each SPODS AD server.
A VM role refers to a type of VM in SPO which performs a certain set of tasks. Usually, multiple VMs of
the same role are provisioned in a farm as a scale out measure and to provide redundancy. The most
common roles in SPO are listed below.
VM Role Description
USR Responsible for serving customer requests
BOT Responsible for running internal workloads such as search crawl and timer jobs
GFE Front end server of the Grid manager service, also runs jobs
SPD AD machine, part of the SPODS AD forest
DSE Local DNS server
DFR DFSR server for infrastructure content replication
WSU Windows Update server
WDS Windows Deployment Server, used for PMachine imaging and IP address allocation
(DHCP - Dynamic Host Configuration Protocol)
MOR Debug boxes
TMT The VM which facilitates communication from Merlin
SPODS Active Directory farm that SharePoint authenticates and authorizes against.
Stamps are the atomic unit of scalability in SPO. Each stamp consists of a content farm and its associated
SPODS farm. SPO strives to create uniform sized stamps with a few exceptions. A standard stamp
typically consists of 160 USR, 65 BOT and 5 SPODS. Those numbers are subject to change as the
hardware SKU changes.
A zone is the SPO’s hardware purchase unit. A typical zone contains 8 racks and can fit 6 standard sized
stamps in it. A large data center can host tens of SPO stamps while a small data center may only contain
one SPO stamp.
A typical SPO stamp can host more than 10,000 tenants. A stamp can scale out its storage to as much as
Azure platform can support. However, it could run out of web server processing power due to the fixed
size of standard farm configuration and physical limit of the Zone. When a stamp is close to full, SPO
starts moving tenant out of one stamp into another.
Bears multi-regional high availability and scalability design principle in mind, Grid Manager has a high-
level architecture illustrated as below diagram. Using a DNS and load balancer, the Grid Manager
frontend consists of a few active-active compute-only farms that serve the traffic across multi-regions.
Under the hood of the backend DNS (SQL failover group), the Grid Manager backend comprises the
primary, disaster recovery (DR) and standby SQL farms. At any given point in time, the read-write
primary SQL serves most of the traffic to guarantee a strong data consistency, and the DR SQL serves
only low priority read-only traffic. Continuous Geo replications are setup from primary to DR and
Standby SQLs to provide required Geo redundancy for business continuity, the SLA is backed by Azure
SQL RTO and RPO .
• Work Engine
o Job Agent
o Managed Jobs
o Job Scripts
• Infrastructure Managers
o Topology Manager
o Machine Manager
o Database Manager
• Secrets Manager
• Tenant Information and allocation
• Change Tracking
All Grid Manager APIs should be idempotent. For example, there is no AddTenant API in the Tenant
Manager. Instead, there is an UpdateTenant API. The first time UpdateTenant is called for a given
tenant, the web service “updates” the tenant into existence. Subsequent calls change the properties of
the tenant. Therefore, multiple calls to UpdateTenant produce a converging result, regardless of success
or failure of individual calls.
This infrastructure provides good flexibility and cross-version capability, easy ability to interoperate with
all kinds of other services and calling agents, and easy debugging. The various web methods in the APIs
follow two main patterns: Gets and Updates. Gets fetch information. Updates can be more complicated
since the work to carry them out can take some time. Most Update methods take an object as the input
and return the same object as the output. The output object would return the current state of the
underlying object in the database, potentially differing from the input if the Update changed some
properties. For example, record IDs calculated by the GridManager will be filled in. Update methods are
used for both initial object creation and for subsequent updates. Therefore, callers can simply specify
the final state they want the object to be in and don’t need to keep track of whether the object already
exists or not.
The service schedules jobs using an algorithm that provides a limited degree of fairness between
networks. Because most jobs scheduled for execution in the Work Manager require resources within an
individual farm, ensuring effective concurrency across all networks leads to an improvement in service
efficiency. The Work Manager creates a thread pool in the W3WP.EXE process on each Grid Manager
front-end (GFE). Each thread then sits in a loop where it dequeues a job, then will find an acceptable
(target) machine to run on and assign the job to the appropriate machine. The job agent service on the
target machine will acknowledge the assignment, download all files needed to execute the job, create a
new process for the job, and executes the job. The job agent maintains a thread on the target machine
that monitors the local job until it finishes.
Jobs are associated with a specific version of a job package. Therefore, per build copies of the scripts are
deployed in side-by-side directories in each datacenter. Most jobs will run the latest installed version
(Get-GridJobPackageVersionMap -LatestVersionOnly) but jobs in the LegacySPOJobs package commonly
pick the version of the script to execute based on the associated VM.
When a job fails, or the allotted time for the job expires without completion, the Work Manager will
typically retry the job, starting the job on the last step that was run. The Work Manager therefore relies
on jobs being idempotent. If the job repeatedly fails over an extended period of time, it will be
suspended. When a job is suspended it is no longer executed and is put on a list of jobs for the
engineering teams to examine.
Each job has an associated job type, which is a 16-character maximum name. For Powershell jobs this
matches the name of the script file. For managed jobs this is the name of the class or can be overridden
in ManagedJobOptionsAttribute. If the length exceeds 16 characters, it is truncated. Joblets are the third
type of Grid job. Internally this is a special type of Powershell job (InvokeJoblet) that hosts the Joblet
framework and exposes its own job-like API.
Jobs run 100% in job agent mode, meaning every job must have a target that identifies how to run the
job. In many cases the job system can infer the target based on the input parameters if the target is not
specified explicitly. There are three types of execution targets:
1. Machine. These always target a single machine (physical machine or virtual machine). The job
will always run on the specified machine. These types of jobs generally do things like system
configuration, troubleshooting/debugging, and installing or upgrading a component on that
machine.
2. Logical resources. These jobs operate against a particular resource, for example database,
tenant, or farm. The job framework will automatically select the appropriate machine targets
and load balance.
3. GFE. Jobs here generally fall into two categories: Orchestration jobs, which simply stated are
jobs that run a train or operation on a particular farm or zone resource. They create child jobs
on the child targets and monitor their progress. These orchestration jobs generally do not
interact with anything other than the GM web services and do not need to run on a specific
target. In a sense the GFE target functions as a default host for jobs. The second category is
high-privileged jobs. Jobs running on GFEs can access sensitive secrets that are not accessible
from other targets. For example, Azure resources are generally provisioned from the context of
a GFE so that the Azure admin credentials are not exposed to the content farm.
Jobs always run in a domain less mode. In domain less mode jobs run as a virtual account that is an
administrator on the local box. The job will use certificate identities when communicating with
GridManager (GM) and Sharepoint. In domain less mode it is still possible for the machine to be domain
joined, in which case the computer account (ComputerName$) is used for off box calls while interacting
with domain resources.
There are several best practices for writing Grid Manager jobs:
The Topology Manager is used to keep track of the high-level logical records of the service. This includes
the region or geography of the hardware, the datacenter, the zones and networks, and the farm objects.
Farms belong to a network; networks belong in zones and zones reside in a datacenter, and each
datacenter lives in a region.
The Machine Manager keeps track of the machines, both physical and logical. It records the physical
machines and their physical groupings in datacenters, the groupings of virtual machines into farms and
farms into stamps. The Machine Manager typically does not have an understanding of the specific
functions of each machine in the topology beyond its role; see section 3.4 (Virtual Machine Topology)
for more information about roles. With a few exceptions, virtual machine configuration scripts are the
only entities in the service that understand what services each role actually runs.
During deployment, the Machine Manager is responsible for picking the p-machine to host a new v-
machine. The v-machine placing logic picks the host by first applying a set of constraints to reduce the
set of possible p-machines, and then applying a ranking order to pick the optimal one. The logic
emphasizes the high availability of v-machines by ensuring that any two VMs of the same role are
spread across failure domains, typically server racks. This ensures that when all machines in a single
failure domain fail that a farm never loses machines of a given role. The logic also attempts to maintain
fast network connections amongst v-machines in the farm.
In a similar fashion the Database Manager picks the appropriate SQL azure location for a new database.
This logic is based on the utilization of the SQL servers in the stamp and the expected IO and capacity
needs of the new database.
When tenants are created, they are placed in datacenters in the region requested by the user when
creating the tenant as part of the Office 365 offering. Within the datacenter they are placed on any farm
that is marked in the Grid Manager as being open to new tenants. Farms are closed when they are full,
or because there is another reason that Ops has decided to not place new tenants in that farm such as
an impending upgrade.
Placing tenants is complicated by the variety of sizes of tenancies, from a few users to 20,000, and by
the way in which new tenants appear. When a new tenant appears, there is little or no information on
their eventual size. Therefore, placement decisions are made before the service understands how large
the tenancy will get. This leads to conservative decisions about when to close farms.
Certificates are the preferred way to communicate between servers since they are more reliable and
easily revocable if the secret is compromised.
For Jobs, job authors must notify the system at job creation time which secrets scoped are needed. The
secrets for jobs are specially encoded and have only a five-minute life span for retrieval during the job
execution. A job asking for the secrets after five minutes will result in an error.
The diagram above briefly shows our current Gridmanager traffic workflow. In each Gridmanager farm
we have an independent pair of physical load balancers, which will route the incoming traffic to GFEs in
round robin order. Our local DNS server will store all farm load balancers’ IP addresses, so that when a
client calls Gridmanager via CNAME (gme2.yloppe.msoppe.msft.net for example), it will reach one of
the farms. If for some reason the farm is not available, the client is still able to use the service by re-
querying the DNS server to call a separate Grid Manager farm.
A load balancer monitor is in place to check each GFE’s health. Every 5 seconds the load balancer will
send a simple http get request to each GFE. If GFE does not response the load balancer will
automatically take the GFE out of traffic so that client will not reach to any bad GFEs. Whenever the GFE
is healthy again it will start taking traffic automatically. If the monitor detects too many unhealthy GFEs
it will trigger an alert that our on-call engineer will start investigating the issue.
At the SQL layer GridManager uses a third read-only replica in addition to the primary/failover.
The GFEs uses a pluggable data cache interface that supports any serializable object type and multiple
caching options. The current implementation uses an in-memory cache, Azure Redis Cache, and local
disk. The Redis cache layer allows the GFEs to store cached data across several machines while avoiding
potential high latency for cross-farm/region SQL queries. If for any reason Redis Azure cache is down the
GFE fallbacks to a local in-memory cache temporarily, and switch back to Redis cache whenever it is
available. Objects, like secrets, have the option to only use in-memory caching with encryption.
Currently, clients such as jobs, use an in-memory local cache and can notify the server which cache the
call prefers such as the GFE cache or SQL replica.
5.2 Deployments
5.2.1 Upgrades
• Preloading- all of the files and other artifacts needed for upgrade are copied onto each VM in
the farm. This is done without taking the VM out of service rotation, as it is minimal impactful to
the operation of the VM.
• InPlaceUpgrade (IPU)- also known as patching. In this phase, the VM is taken out of service, new
binaries are copied into place, the VM restarted and placed back into service. This is done on
small batches of VMs on a rotating basis, and is done during off-peak hours to minimize
operational impact.
• Full-Uber- during this phase, all of the SharePoint upgraders and other maintenance processes
are run. This is where SharePoint objects and configs get updated.
• DOU (Database Only Upgrade) – This is the final phase of upgrade, where the SQL databases for
the farm are upgraded. In this phase, schema changes, SPROC updates and all other SQL-specifc
changes are rolled out.
The Security Patch orchestrator runs once each month, starting on the 2nd Tuesday of the month (“Patch
Tuesday”), and hist 95% saturation in around 14 days, with full fleet coverage in 28 days.
The Update Monitors orchestrator runs every 2 weeks and delivers monitoring and other operatability
upgrades to the fleet.
These orchestrators run on a Zone basis, as they update both Physical and Virtual Machines.
In all cases, a customer’s request is turned into one or many HTTP requests by the App being used. What
happens next is,
1. The APP does a DNS query on the customer’s domain name and gets an IP address which either
points to the Microsoft Azure Front Door (AFD), or the end point of a SPO Content farm
2. The HTTP requests are sent via SSL connection to the end point
3. If the end point is AFD, the request will be forwarded to the correct SPO content farm via a
warm connection. AFD will act as a reverse proxy and relay the response back to the customer
4. The content farm load balancer receives the request, and picks one USR role to serve it
5. The USR machine terminates the SSL, calls the Config/Sitemap service in the farm which does a
mapping from the URL to a database
6. The USR machine does a series of queries to the database, and if necessary, also calls the Azure
blob service to load data. The USR machine will assemble the response and send back to the
customer via Direct server return.
The next few sections will examine these steps in more detail.
1
Customer in general prefers simpler DNS structure. The complex CNAME chain of SPO DNS records is something
we are looking to simplify in the future.
13.107.138.8
Each row in this table represents a step in resolving the domain name to its destination. A CNAME
record maps one domain to another. An A record maps a domain name to an IP address. The TTL column
stands for “Time Til Live”. It specifies how long client computers can assume a given record will remain
constant. Typical TTLs range from a few seconds to a few hours. Usually, SPO sets a relatively high TTL
value if the record is not expected to change often, while a low TTL value indicates the record is
expected to change frequently. For example, record no. 4 and 5 in the above table shows the domain
name which will change value during a “failover”, which SPO uses to recover from any farm, database or
network outage. Once this happens, the customer needs to be aware of the new destination ASAP.
Hence such a record should be kept for no more than one minute in the local DNS cache.
Record 1-2: These records consolidate the various SPO services, such as ODB, Tenant Admin, and Team
sites, into one canonical domain name. These records are created during the tenant provision time and
are hosted in Azure DNS service
Record 3-4: These records first redirect an individual tenant record to a clump of databases, and then
further redirects to the farm hosting that database clump2. These record help facilitate database or farm
level failovers. These records are also hosted in Azure DNS service
2
The database clump concept is explained in more detail in the “Storage” chapter.
The overall SPO DNS hierarchy and how it integrates with AFD (Edge) is illustrated in the figure below.
3
Sometimes, it is preferrable the traffic does not go through AFD. One such example is a microservice calling SPO
from a location which is closer to the farm itself than the Edge box.
Tenant-level DNS records dictate which endpoint a customer connects to – AFD or Direct. By default,
most customers are routed to AFD endpoints for better performance (TCP proxy)
Every Sharepoint FQDN will be resolved by DNS to an A or A/AAAA record, which represents the IP
address of the load balancer devices. Each content farm utilizes a hardware load balancer device (NLB)
to accept incoming requests. Most of the NLBs used in SPO farms today are manufactured by F5
Networks in the BIGIP product family. The NLB maintains a list of available content front ends (USRs) for
each farm, called a pool. The NLB performs load balancing at L4 (TCP) to USRs via basic round-robin
algorithm (non-persistent). When an inbound request arrives, the NLB then randomly selects a USR from
the pool and routes the request to it. The NLBs are not programmed to enforce any user, tenant, or
session affinity. Thus, requests from a single user session could potentially be routed to different USRs
within the content farm. TCP and SSL sessions are terminated on USR VMs; no TCP tuning, auth, or certs
on the NLB. NLB probes VM health via Circuit Breaker monitors. The return traffic bypasses the NLB, aka
Direct Server Return. More information about SPO DNS routing could be found here.
2. The traffic goes to the GMR/L3AGG router directly, and it will not use any SNAT address. As a
result, such traffic can only be routed within the Microsoft network. The traffic which chooses
this path typically reaches one of the following destinations. This route is shown by the blue
arrows in the graph
a. Azure Storage/SQL/Redis for Content farms
b. SPO Backend blocks/IPs (server-to-server)
c. SPO MGMT blocks/IPs (DPROD MGMT)
d. Secure Workload Environment (SWE)
e. Time targets (NTP)
3. The return traffic to end user request will go to the router directly. It uses the Direct Server
Return configuration on the web server so that the return packets will not go through the
One important consideration why there is more than one outbound routing path is because the network
load balancer has limited bandwidth. It will quickly become a bottleneck if all the outgoing traffic were
to go through it. The L3AGG router has a much higher bandwidth than the network load balancer.
AFD uses anycast to route the end users’ traffic to the closest web server farms. Many SPO customers
have a global presence, and their users are located across many countries. The end users would connect
to the closet AFD environment.
When the AFD web server farms have received the end user requests, AFD would forward the request to
SPO content farms.
AFD allows the network connection to recover faster if there is packet drop in the end user connection
and improves the overall throughput of any file operation.
Most requests going through AFD are forwarded to SPO without being cached. Some content requests
can be cached for limited time and AFD would return the cached content without forwarding the
request to SPO. SPO uses signed parameters in the URL to control the content caching and how long the
caching would work. SPO signs the URL parameter using the private key and the AFD would use the
public key to validate the parameters are signed by SPO.
Based on the current configuration, the Akamai GTM entry will conditionally tell the client a DNS entry
to route the traffic to the farm direct traffic or to route the traffic to AFD
This table shows 2 options for the GTM DNS entry to conditionally return different destination of either
farm direct or AFD end point.
Or
For regional failover, SPO would define which country would return farm direct and which country
would return AFD. For example, if SPO knows United Kingdom’s AFD nodes cannot route traffic, SPO
would do a regional failover of the United Kingdom to allow United Kingdom users to connect directly to
the farm direct and allow the rest of the world to continue to connect through AFD.
For AFD bypass, SPO has defined an AFD bypass Classless inter-domain routing (CIDR) map. Based on the
IP subnet of the Azure network, the Azure DNS servers would be returned a farm direct traffic end point
for server-to-server traffic.
For example, the meta microservice is used to process different encodings of video files. The push
microservice is used to process notification to the end client.
Each microservice has its own DNS domain and server capacity in Azure. The end user would
authenticate against each microservice end point and then authentication would carry over to SPO to
access the user’s data.
The easiest way to understand the USR role and the flow of a request is to walk through what happens
during a request. Some of these areas will be mentioned here and covered in more detail later in the
document.
Requests are processed in a series of event callbacks between the layers of the architecture. After
requests are dequeued from the asp.net queue, a series of SharePoint and platform modules are called
to handle these events, after which, execution returns to the underlying platform to handle core
functions such as serving file contents. A small portion of the file contents are served from static files
lived on the USR role, such as those live under _layouts folder. But the vast majority of the files are
served as virtual files, where the content was read from SQL and Blob service and assembled on the fly.
Figure 7-1 shows a high-level relationship of the components running on a USR machine, where the
system components and SPO components are depicted in different colors. Figure 7-2 shows more details
of the important steps happening in the Core SharePoint module (SPRequestModule)
• Rewrites URL paths for routing to the correct ASP.NET handler. With some exceptions (static files, soap
Post requests, etc), this will usually be the native code handler that services content out of the SharePoint Content
Authorize Store
Request
Partners can help identifying themselves and their intent for making requests using request
attribution guidelines found at RequestType Dev Design.docx (sharepoint-df.com)
At very high level, various workloads are classified as follows for the purpose of health
protection throttling:
Health Classification Workloads
Throttling
Priority
Minor Background activity Backup Application requests, Data Loss Prevention
that is least time (DLP) applications requests, Internal system jobs
sensitive processing async work units, specific scenarios
from 1P apps such as sync verification or profile
photo refresh, etc.
Major Relatively Less time Migration Application requests, OneDrive Sync
sensitive end-user client’s uploads and downloads, Camera Roll
traffic backup from OneDrive Mobile clients, etc.
All end-user traffic is by default classified in this
category.
Critical Any load (including Any and all api calls. For example, time-sensitive
time sensitive end user-interactive flow such as opening a file in Office
user experience online, on-demand downloading a placeholder file
impacting traffic) by OneDrive sync client when user double clicks on
placeholder file stub
Take the
healing Run warm up
Run Health Calculate
Push this actions on probes
validation weighted
data into the worst before
checks/probe score based
redis based health brining
s against the on individual
priority machine (if machine
VM every 5 health probe
queue allowed by back into
secs results
SafetyNet rotation
threshold)
SP ping Tests a custom HTTP method that validates select system level state
from within the Content app pool
HomePage Probe Basic SharePoint page validation that ensures the VM can serve basic
SharePoint page content
SharePoint Online, like all other M365 Services, uses AAD (Azure Active Directory) as the identity
provider. What this means is that every M365 service doesn’t need to provide a redundant sign-in
experience and can instead rely on AAD’s identity platform and sign-in experience. All tenants and their
associated users as well as licenses are centrally stored in AAD. Other M365 Services can then redirect
all sign-in requests to AAD. And with the incoming token, they can validate the identity in the context of
the tenant. This also implies that M365 services sync these objects from AAD to their own service farms.
In SPO’s case, we sync these objects into our local directory (SPODS as well as SQL SPODS) for faster
access and for reduced calls (less COGS) to AAD’s directory.
There are two major authentication flows – User sign-in, and Service-to-Service calls.
8.1.1 Browser
For the browser sign-in, when SPO detects that there is no FedAuth cookie in the incoming request, it
redirects the user to AAD. At this point, AAD may prompt the user for credentials if the user hasn’t
signed in recently to any of the M365 Services. In some cases, depending on the tenant configuration,
AAD itself may not prompt for a sign-in but redirect the user to the tenant’s own on-premises Active
Directory service (ADFS) for authentication.
As mentioned above, SPO uses the OpenID Connect protocol (Final: OpenID Connect Core 1.0
incorporating errata set 1) and requests AAD for an id_token as well as a code. When the user has
completed authentication at AAD, they are redirected by AAD back to SPO with the artifacts (id_token,
code). SPO then verifies the identity of the user in the incoming id_token by checking this identity in its
local directory (SPODS) and confirming its existence.
If this is successfully confirmed, SPO completes the rest of the sign-in by generating an internal token
(for our own book-keeping), storing this token in the local memory cache as well as distributed cache,
and most importantly, generating a cookie called FedAuth which is a signed representation of some
critical claims. This cookie is issued for the tenant’s domain, say, contoso.sharepoint.com. Henceforth,
as the user interacts with the site, this FedAuth cookie is sent by the browser, and SPO quickly validates
the cookie’s signature and finds the internally stored token from the cache mapped to this cookie, and
sets the user’s identity on the thread, and execution moves on to the page or resource requested.
Say you are in the Outlook Web Client (OWA, purely as an example, and not exactly the way OWA works
but other clients work this way) connected to the Exchange service. You create a new message to a
colleague and wish to attach a file from your OneDrive. What the Exchange service would do is make an
S2S call to SPO for the user’s OneDrive contents and then show these files to the user by returning them
to the OWA client.
S2S calls happen in two possible identities: a) as an app+user identity containing both the App identity
and the user identity. b) as an app-only identity, in which case the user identity is not even present in
the call. This is a high privileged call and is not recommended due to the security concerns with this
pattern. However, if this pattern is unavoidable, you will have to get the appropriate permissions (say
Sites.Read.All) from AAD, and then SPO will additionally grant granular permissions (aka logical
permissions) to specific APIs within SPO (instead of all APIs) to further secure this model. We have an
onboarding and maintenance process for the 45+ M365 partner services that we serve.
S2S calls can also be looked through the incoming/outgoing pivot, i.e. whether the S2S call is incoming
into SPO from another service, or whether the S2S calls is outgoing from SPO to another service. ODSP
Auth Patterns (sharepoint-df.com)
where <token> is a JWT token containing the pft claim (which contains the actual user token issued by
AAD) and the at claim (which contains the actor token issued by AAD for your app). The JWT token is
signed by your own certificate (hence POP, i.e. Proof Of Possession of this certificate)
The old pattern without POP is not recommended or supported for new scenarios: Authorization:
MSAuth1.0 PFAT
Also not supported for new scenarios are ACS legacy protocol, as well as AppAssertedUser or
ServiceAssertedApp in EVOSTS protocol.
The 1st prong is to avoid going to AAD as much as possible. We do this in 3 ways:
a. For Actor tokens that are needed for outgoing calls from SPO to other services, we not only
cache these tokens (24h lifetime), but also proactively fetch them when they have less than 12
hours lifetime left. This has allowed us to have a 95%+ cache hit rate and allows us to survive
AAD outages that are up to 4 hours long (AAD outages are usually resolved in much less than 4
hours)
b. For access tokens used by native/mobile clients, our support for Continuous Access
Enforcement has meant that these clients receive tokens with a 24h-28h lifetime, as compared
to tokens that used to have a 1 hour lifetime. Our innovation with CAE helps greatly during an
outage because of the longer lived tokens being used by the clients.
c. For browser sessions, we take a more lenient stance during an outage, and provide forgiveness
for expired sessions if they expired in the last 48 hours. This forgiveness allows us to serve the
browser sessions without redirecting the user to AAD during an AAD outage. We forgive about
1 million expired sessions for every hour of outage in AAD.
The 2nd prong is to have a backup for AAD itself. This is the Credentials Caching Service (CCS) which is
built on the Substrate platform. This solution has been built by AAD itself as a backup solution.
The way CAE is implemented is that SPO has its own event hub in Azure. AAD publishes user revocation
events and policy change events into this hub. SPO also built a Microservice (SPAuthEvent) that listens
to the events from the event hub and forwards them to SPO at a REST endpoint (we use MS Graph to
determine the actual portal URL for the tenant and therefore this gets routed to the proper Content
Farm).
Device compliance events like DeviceNotCompliant are slated for support in 2023.
UnManaged Device Policy: SharePoint and OneDrive unmanaged device access controls for
administrators - SharePoint in Microsoft 365 | Microsoft Learn
IP Policy: Network location-based access to SharePoint and OneDrive - SharePoint in Microsoft 365 |
Microsoft Learn
Site level granular policy with Labels & AuthContext: Manage site access based on sensitivity label -
SharePoint in Microsoft 365 | Microsoft Learn
Information Barriers: Use information barriers with SharePoint - SharePoint in Microsoft 365 | Microsoft
Learn
Block Guest access to Sensitive files: Prevent guest access to files while DLP rules are applied -
SharePoint in Microsoft 365 | Microsoft Learn
As can be seen from above, SPO runs two authentication stacks to support the enterprise and consumer
OneDrive within SPO farms. These stacks use AAD v1 and RPS, respectively (for browser). RPS is old or
not maintained anymore. It costs a lot of engineering effort to maintain both stacks along with their
protocols. For some new clients they will not even support RPS at all. So, the goal is converging
consumer account authentication and business account authentication by using AAD v2 stack - this
means one stack, one protocol. This also fits into the overall organizational goal of ODB/ODC
Convergence at all levels.
Currently, we block all the third party apps direct access to SharePoint consumer tenant and allow only
designated list of 1st party apps or calls proxied through Graph.
Within MSODS, there are partitions known as service instances (SIs). When a company signs up for an
M365 service, their directory information is placed into some SI for that service, for example a
SharePoint Online license might result in a company being placed into an SI called “SharePoint/APAC-
0007”, whereas an Exchange Online (EXO) license might also result in the company being placed in an
EXO SI like “Exchange/apcprd03-009-01”.
SharePoint Networks to SharePoint SIs is 1:N – each SPO network is associated with one (but in few
cases, multiple) MSODS service instances (SPO networks only associate with “SharePoint/*” SIs, never
with for example Exchange SIs). Each MSODS service instance that belongs to SharePoint is associated
with only one SharePoint network.
Due to the volume of requests that must be made for directory information by SPO (for, among many
other things, authorization (authz)), it is untenable to use MSODS directly to service such requests.
Therefore, it is necessary to maintain a local copy of the data from a given SI in the SPO network that
that SI is associated with. That local copy is SPODS.
More information on how SPODS is structured and how SPODS is kept in sync with MSODS can be found
in Tenant Life Cycle
9.1 Blobs
9.1.1 Why a Blob Store?
• Storing file data in SQL creates many challenges
o SQL requires frequent (weekly or more) full backups resulting in many times DB size (today 10X) bytes in
the backup system
o SQL maintenance is driven by DB seeding time. Larger DBs take longer to seed increasing incident count
o SQL has high overhead for large var binary columns (off row storage). Up to 9k per entry.
o SQL has highly variable performance for large payloads.
o Large file support bumps into physical transaction limits (2GB)
o Large tlogs create replication challenges (latency) putting DR SLA at risk.
• As SPO grew the database backup system failed to scale physically and $$ wise. With less than 1 year to collapse
ABS was started and delivered on time to avert the crisis
o ABS replaced a simpler but scale challenged solution with a more complex but scalable solution
o Engineering the more complex scalable system is a more solvable problem, and once solved it pays off
repeatedly, especially when during rapid growth.
9.1.2 ABS: An Abstracted Partitioned Blob Store
• ABS is not a thin wrapper over Azure Storage Block Blob, ABS is designed to be hosted on any compatible backend
store.
• Existing hosts are In-memory-store (for testing) and the Azure Storage Block Blob system (production)
• ABS has a defined ‘host services’ contract with the backing store that allows additional hosts to be created if/when
needed.
• ABS does not expose the host store to client. The Client API surface is identical regardless of the host (but some hosts
don’t support all features)
• Partitioning is client controlled using a supplied PartionID (string). All blobs are identified by their PartitionID/ABSId.
The PartionID was added to support clients ‘Delete Partition’ type requests.
• Immutability is provided even on mutable stores (ex:Azure Storage Block Blob).
9.1.3 ABS solution to the SQL Backup Problem
• ABS replaces backup with ‘Deferred Deletion’
• Being an immutable append only system allows simple retention to support metadata rollback at very low cost
• All blob that are to be deleted are ‘aged’ in a SQL table until all possible SQL backups that could restore a reference to
the blob are gone
• This provides logical ‘full backup’ for cost of only 1.8% additional storage per week vs 500% via SQL backup.
• The delete aging table entry (DeletedABS) is added in the same transaction that removes the reference. On DB
rollback Blobs are ‘auto’ undeleted as the rollback that restores a reference also removes the deletion table entry.
• Jobs that process the DeleteABS entries make further checks to handle aliasing and other restore workflows. Once a
blob passes checks it is deep deleted.
9.1.4 ABS v1.0 (Blob Drain)
• v1.0 attacked Database Backup size by reducing DB bytes
• Transaction size was not reduced as all blobs were written to DB first
• Background jobs moved in SQL file blobs to ABS
• Background job was effective for data written before ABS rollout and ‘ok’ for net new writes.
• Effective end to end hashing system to ensure no corruption in transit (transit from SQL to Azure Storage Block Blob
Availability Cluster or region issues impact Resilient to cluster down and region down for Read and Write.
availability for Read/Write Hardened against DNS failures.
Byte/TPS limit Account 5PB max/25kTPS. Single 300PB per ABS system, 1.5m TPS. Blob reads distributed across 60 Azure
cluster limited. Storage Accounts on multiple physical clusters.
Blob Id Client controlled string. Write failures ABS generated crypto random GUID string. No conflicts.
due to conflicts possible.
Encryption Service or Client library controlled. ABS implemented encryption/decryption, unique key per blob (UKPB).
Many blobs to one key model.
Blob Read/Write Single operation single blob Batch operations supported on multiple blobs (async)
Immutability Not supported. Fully supported, all writes to existing blobs fail.
Partition Limit Container # limited to thousands if Unlimited Partitions. SPO has 1 billion partitions (sites) today.
container Policy used.
Read latency Driven by single request Dual read reduces 99th percentile latency by reading DR copy in parallel
Data needed to Blob name, cred to container PartitionID, ABSId, ABSInfo (opaque byte array containing encryption
read blob keys etc..), ABSLocatorId (address)
BYOK support Account level support only. Single key Partition level support. High availability using 2 key vaults in different
vault dependency. Geos with compliant caching to handle DNS/AD issues.
The figure below shows the Primary and DR pool approach for ABS providing 300PB of storage and 1.5million TPS.
There are two types of content databases: shared content database and dedicated content database. A
typical tenant has all its sites in a single shared content database, that is shared with other tenants.
When a tenant grows big, it is isolated into its own dedicated content database, that is not shared with
other tenants. An isolated tenant can occupy multiple dedicated content databases.
When a database grows too big, determined by thresholds such as site collection count or document
count, it will be automatically split into two smaller databases.
Config and SiteMap databases are considered “single point of failure” databases for the farm, so they
are also hosted in standalone mode, so as not to be interfered from content databases.
SPO uses Sql Azure vCore purchase model, which allows the service to scale CPU and storage
independently. A typical SPO elastic pool has 10 vCores, and can go up to 80 vCores. SPO hosts vast
majority of the databases on the Business Critical Service Tier, which keeps 4 local nodes and provides
high IO bandwidth, a hot standby node and a second node dedicated to readonly access. The RO node
was used extensively in SPO service to offload load on the primary node. Very few Databases with low
IO demand are hosted in General Purpose service tier, which has the advantage of being more cost
effective.
To handle organic growth, we collect the CPU, worker thread percentage and other metrics from every
standalone database or elastic pools. We average these metrics by 15 minutes spans, and compute a
“score” based on top 15-minute spans from the past seven days. An elastic pool with a score of 60%
means usage from the peak 15-minute span of this pool is at about 60%.
During off-peak hours, we do resource rebalance, trying to keep pool score within optimal range. The
range used currently is (40%,75%), but that is subject to change. We split an elastic pool if its score is too
high and contains too many databases; upgrade the elastic pool if its score is high and doesn’t contain
too many databases; eliminate the elastic pool by distributing databases in the elastic pool to other
elastic pools; and downgrade the elastic pool if its score is low and cannot be packed away.
During daily rebalance, we try to keep elastic pool vCore count between 10 and 20. vCore count too low
will not be efficient and may not have the power to handle peak hour demand. Too big a pool also has
its problem: upgrade/downgrade takes longer, because more data needs to be copied from one node to
another node. This will cause problems during an incident because we will not be able to scale fast
enough. Furthermore, too many databases in a pool may also introduce session limit problems.
To be able to respond to sudden spikes quickly, we installed multiple monitors. Our monitors fall into
three categories: active monitors, which actively send requests to each database to make sure they are
alive; Azure alerts, which monitors database metrics such as CPU% and worker threads% from Azure
side; and passive monitors, which tracks QoS data based on end user traffic results, such as key stored
procedure latency and errors. These monitors will generate alerts. These alerts are hooked up to a
response system called Diagnostic Scripts, which allows custom code to be run to automatically heal the
database. Depending on the situation, auto healing actions may decide to increase storage, or do SQL
node failover, or upgrade/split the elastic pool, or failover the database to its geo-secondary.
The vast majority of the SQL query execution in SPO goes through a class called SqlSession, which is
shown as the green box in the top left corner. Among other things, this class manages connection pools
to ReadWrite and ReadOnly instances of the same SQL database. It also tracks connection speed and
could decide to throttle requests when the SQL connection becomes too slow. This class also logs the
query information to be uploaded to COSMOS store later for further aggregation and analysis. The SQL
log includes the Id of the caller app and many other useful information which can be used for cost
attribution. Once this information reaches the COSMOS store, a daily job will generate SQL telemetry
and attribution reports, which can be used to further analyze SQL usage in depth. This aggregation
process is represented by the blue boxes in the figure above.
The orange box in the figure above listed several important timer jobs which enhance and utilize the SQL
telemetry. The Sproc attribution job and SqlQuery attribution job are needed because the SQL log
alone does not give the complete picture of SQL usage as it contains only the stats seen by the SQL
client. These two jobs enhance the data by querying the database DMV table to obtain SQL server
statistics such as milli-seconds spent in the CPU core and associate them with the SQL query executed.
The Sql Metrics Collector job collects CPU/worker percent etc. metrics every 10 seconds and saves the
data to Redis cache and Geneva MDM. This data is shared by all front-end machines and can be used to
Many more information related to SPO database usage attribution and database performance stats
could be found at http://spo-rt/SqlCPU and http://spo-rt/SqlPerf. The following is a sample chart
showing the attribution of SQL usage from different applications on a given day.
• Deferred functionality - that can run asynchronously as a follow up to a user / admin action
• Tenant admin provisioning / update actions
• Security and Compliance - Information Protection, Antivirus Scanning, …
• Periodic computation of user visible quotas etc.
• Periodic optimization and maintenance - on object in the content database / blob storage
• Integration / sync - user profile data, webhooks, content types, …
• Migration
• Monitoring and alerting
To run these tasks, SharePoint Online has its own scheduled tasks management service, manifested as
Timer Service instances installed on the BOT and USR VM roles in content farms. Most of these tasks are
carried out on the BOT role, but some execution happens on the USR role as well. If the Timer Service or
any of its instances begins to malfunction, it will not take long for problems to begin appearing across
the farm.
Another critical workload that needs separate discussion is Search Crawl - which responds to create,
update, delete operations on user content and pushes changes to the M365 search index. This workload
runs in its own processes (independent of the Timer Service) on the BOT role within all content farms.
Timer jobs are units of functionality encapsulated within .NET classes and authored by various
engineering teams within (and outside) SharePoint. Today, we have more than 300 active timer jobs.
These vary in terms of schedule, resources they operate on and their CPU and memory resources
consumption. Unlike Search, timer jobs usually take up to 10% of CPU on a loaded BOT machine but are
memory intensive. They also contribute to about 10% of the downstream COGS on Azure SQL.
To avoid internal workloads' resource consumption inadvertently impacting user facing web and API
traffic, they are largely restricted to running on the BOT VM role. The USR to BOT VM ratio in a content
farm is 2.5:1. Additionally, BOT VMs are allocated half the CPU, memory, and disk as USR VMs resulting
in a 5:1 resource allocation ratio between the 2 roles.
The above separation allows BOTs to run at 100% utilization while USRs need to be kept at 65%
utilization to provide for spiky traffic.
These are declared in the lock type of the timer job: Job, None, and ContentDatabase.
From 2022, there is another option available. Assignments are a feature designed to replace locks.
Rather than being limited to one of three thread models (1 per set of Content DBs, 1 per Timer service, 1
per farm), assignments allow job owners to customize their thread model. Assignment definitions
specify the number of threads to run per target resource, the number of Timer service instances those
threads should be spread across, and the maximum number of threads a Timer service instance should
run at a time. The target resource is usually, but not always, a content database. Assignment definitions
can also specify filters for resources and service instances. For example, an assignment might only run
against DBs that are not in read-only mode and might only run on BOT machines.
10.2.2 Schedule
The schedule determines how often a job will run. Using a schedule allows jobs to run at a desired
period as opposed to running continuously and checking if work needs to be done. Jobs share resources
like threads, network bandwidth and SQL cost with other jobs so authors are responsible to ensure that
their job does not run any more frequently than necessary.
The shortest schedule allowed is every 1 minute. The timer service introduces jitter into job execution
by randomizing execution within the limits of the specified schedule. This prevents jobs of a certain type
from starting at the same time and overloading downstream resources like the content DB.
From 2021, the timer service can prefetch work items for opted in jobs based on their schedule, thus
reducing multiple jobs hitting the content database. It also optimizes job invocation by skipping those
work item type jobs that have nothing to process.
Work item type job owners are responsible for enqueueing at a reasonable rate. This depends on the
scenario and/or SLA, as well as the rate of dequeuing/processing. If enqueuing rate exceeds dequeuing /
process rate, the work item queue builds up and causes service degradation on the farm. Since 2021,
work item jobs are throttled at the entry-point and producers needs to be able to handle the
corresponding failures. The queue size threshold is set per work item type and current default is 30M.
The timer service passes a state object to pausable jobs in the Execute() method, persists the state upon
pause and restores it upon resume.
10.2.5 Documentation
If you are interested in learning more about the Timer Service, how to author your own job as well as
best practices for authoring, the Timer Service Wiki is the best place to start.
Content Push Service (CPS) was introduced in 2019 which replaces the old Search Crawler from the
Search Farm. CPS runs on the BOT roles in content farms and the crawl state is stored in the Content
DBs. The CPS service is tightly integrated with SharePoint and introduced the concept of scenario-based
priority queues to enable High Visible user changes to be pushed in seconds. CPS submits updates to an
Azure service owned by the FAST team called SCS (Search Content Service).
Search Content Service (SCS) launched in 2014. This was a key project and directional shift that
disconnected the crawler from the search farm. This was achieved by the crawler submitting content to
SCS, and the search farms pulling content from SCS. This enabled warm stand-by for search farms, dual
indexing on PR & DR. SCS also became key for content routing for ingestion into Substrate. SCS is now
one of the largest Azure services.
The completion of Project Greenland means that Search Content Service (SCS), currently an integral part
in distribution of search data to search farms, would become more akin to a persistent Substrate queue,
with much similarity and overlap with the data structures of the ODSP Content Push Service (CPS).
For more details about Content Push Service see our wiki here: Content Push Service (CPS) - Overview
(visualstudio.com)
There is a timer job that runs in the SPO timer service (owstimer.exe) called SyncToAdTimerJob_Sync.
This job is a singleton job, meaning it only runs on one BOT machine at a time in each content farm. At a
high level, the job is responsible for calling the APIs supplied by MSODS, transforming the results into a
form that can be stored in SPODS, and writing those changes to SPODS. The main API used to get the
next set of changes is called GetChanges. This returns a set of changes that must be applied to SPODS, a
flag indicating if there are more changes available in MSODS (more=true or more=false), and a cookie
that must be used on the next call to tell MSODS where we are in the stream.
Errors happen, and sometimes changes cannot be applied immediately to SPODS. It is important to
continue making progress in the sync stream however, so the SI does not build a backlog on the MSODS
side or create a delay where customers notice that changes to their tenant/users/groups/etc. are not
propagated to SharePoint Online. Therefore, forward sync has a concept of a recovery queue, which is
also persisted alongside the cookie in SPODS. Items in the recovery queue can be processed by using an
API called GetDirectoryObjects, which will return the entire full state of that object.
For more details (which are out of scope for this document) on how forward sync optimizes processing
of the sync stream, handles the recovery queue, queues full tenant sync requests, or handles tenant
moves, feel free to reach out to [email protected].
As an Active Directory deployment, AD SPODS utilizes the Lightweight Directory Access Protocol (LDAP)
which is an open, vendor-neutral, standard application protocol that was designed for interacting with
directory stores. Within the LDAP protocol, each object contains a set of attributes. Every object also
maintains a unique identifier called a Distinguished Name (DN) which enables one to find the object
similar to a full file path. Objects can be nested within other objects thus allowing one to build a tree-like
structure. Every tenant maps to a top level object called an OrganizationUnit (OU) object. Within each
OU object, we store all of the directory objects that belong to that tenant – things like user identities,
group objects, device information, etc. – grouped within appropriate subtrees.
AD SPODS requires special hardware to run, specifically there is a DS farm in every network with a set of
virtual machines called SPDs. See section 4.3 for a description of the various VM roles. The SPD machine
role is one of several machine roles in SPO. One of these SPD VMs is chosen to be the “primary SPD” and
is responsible for accepting writes from major workloads such as fwdsync and provisioning, and those
writes are replicated out to the other SPDs using AD technology. Reads for flows like AuthZ can be
performed against non-primary SPDs.
SQL SPODS is implemented as a SQL database (Directory DB) with several tables for each type of entity
that is stored in SPODS, such as Tenant, User, Group, etc with support for group membership expansion.
For more details or questions about SQL SPODS, you can reach out to [email protected].
A Tenant is a representation of a customer in SPO and consists of 3 main parts – Identity, Metadata and
Content.
• User:
o UserPrincipalName (or UPN), similar to an email address, used for login, may change
over the lifetime of the user.
o Passport User ID, PUID, unique immutable ID (long/hex), never changes for the lifetime
of a user. Enterprise generally keys off of UPN or ObjectID, but you may come across it
in the code. Consumer auth tokens often surface the PUID and CID for identification.
o ObjectId, a GUID that uniquely identifies this object. This is assigned by AAD.
o Other metadata like email, phone, etc.
• Group
o Alias, a name for the group.
o ObjectId, a GUID that uniquely identifies this object. This is assigned by AAD.
o MemberOf and Member metadata, indicating direct group membership.
• Membership
o This is basically a “mapping” of group to it’s members, who can be users, or other
groups themselves (this creates a Directed Acyclic Graph of dependencies). Membership
is part of the group object.
Metadata:
• AAD owned metadata: The tenant has a bunch of attributes that determine how it should be
processed in SPO. It can have the list of Assigned Plans (Standard, Enterprise),
VerifiedDomainName (contoso or fabrikam, etc.), Purchased license counts, and so on. This part
of the metadata is mastered in AAD (Azure Active Directory).
• SPO owned metadata: The location of the tenant – SPO Farm(s), SPO Database(s), DNS and
Routing information, Site subscriptionId that identifies all sites that belong to the tenant (this is
usually the same as the companyId), State of provisioning, Workflows that are currently being
processed on the tenant, and so on.
Content:
These are the actual sites (and all their content) whose siteSubscriptionId == that of the tenant. This can
be spread across multiple databases and even multiple content farms.
• TenantStore – this is the master of all SPO information of a Tenant. It is stored in the first
“default” site created to hold the Tenant’s representation, also called the “FunSite” (short for
Fundamental Site).
• Grid – GridManager (GM) is a component that manages SPO topology, and is the master of the
location information of the tenant – the Farm and Database mappings, clumps, and so on. These
must be in sync with the Content database where the tenant’s funsite exists. Grid also has
[This is evolving and information here may be out of date. Talk to specific area owners for details.]
Now, let’s understand how the tenant is actually provisioned in SPO. Many components are involved in
this process, as described in the next 3 sections:
• End customer subscribes through one of our commerce options (portals where the customer
purchases licenses) and depending of the customer Geo location it’s assigned to one of
M365 Data Centers
• Global M365 tenant registry for M365 is MsoDS (Microsoft Online Directory Service)
• A part of MsoDs associated with a specific Geo location/Data center is called “M365 Service
Instance” (or just “Service Instance”/SI)
• SI is a primary data source for M365 components: EXO (Exchange), Yammer, SPO, etc.
• “SNC” (aka “Forward Sync”) is a component responsible for communication between SI and SPO.
It’s represented by a farm-leveI SP timer job. It:
• #1: Pulls recent changes from SI
• #2: Communicates with GM in order to define which DS farm this change should go (for new
entries – fulfills load balancing role)
• #3: Stores incoming change in appropriate DS farm
• #9: Pulls data from SpoDs that is ready for publishing back to SI
• Prepares publications and publishes to SI
• By definition a NW can only have a single Snc. In case of Active/Active Farm configuration,
exactly one of the active farms can run Snc.
• “Prv” – is a timer job that services provisioning requests from a single DS farm. It:
• #4: Pulls entries from a DS farm that require a provisioning action. Such as: onboarding; change,
lock out, deprovisioning.
• #5: Communicates with Grid Manager (GM) in order to detect where affected tenant is resided.
• Schedules an async workitem to be executed by Tenant Workflow Engine.
• In the sign up flow, Commerce directly enqueues the tenant provisioning package (that includes
all the “properties” of the tenant that Prv/WFE would need to provision the tenant) into Azure
Service Bus (ASB).
• Then we have a timer job that runs in some USR and BOT machines in the content farm that
directly pulls those ASB packages and provisions the corresponding tenants synchronously in-
process with no queuing, and publishes information back to MSODS.
• In addition, we also call into (that have 3 minute timeouts) to invalidate any tenant status data
so that Office UX caches get the latest information from AAD for a tenant that’s been
provisioned through Instant-On.
During this process, we make sure that if the tenant happens to land via forward sync faster than
Instant-On (it can happen, though rare), they don’t stomp on each other and we avoid races.
Instant-On is a best effort basis for new tenant provisioning, It does not handle other lifecycle events like
changes or deprov: they still flow through forward sync. We ship a Nuget package that our partners can
download and use to communicate through ASB and send us provisioning requests. This is used by
Commerce today and we may offer this to other partners if there’s a business need (e.g., GoDaddy,
other resellers, etc.).
• Data security, every country needs their data to reside in their geo location to avoid data
crossing their borders.
• Turnaround time of the request will be lower improving the performance and user experience.
Tenant Instances in multi-Geo environment is divided into two categories based on the geo locations:
1. Default Instance – Tenant instance in the geo location where tenant subscription was originally
provisioned.
2. Satellite Instances – One or more tenant instances in the geo locations configured by Tenant
Administrator to satisfy their data residency requirements.
The tenant instance Id with geo location mapping is stored in the tenant store.
Each users OneDrive can be provisioned or moved by an administrator to a satellite location according
to the users PDL. Personal files are kept in that geo location, though they can be shared with users in
other geo locations.
Users are created by administrator from Microsoft 365 Admin Center. Users OneDrive is created
automatically at the time of their first login. By default, Users are provisioned in Default Tenant location,
but they can be provisioned in Satellite location as well by setting PDL before their first login.
SharePoint site geo move is an operation initiated by Tenant Administrators by connecting to the
SharePoint Admin URL.
1. Initialize – When the move operation is initiated by Tenant Administrator then it is added to
Pending Move SPList (PDLChangedList) in the Initialize phase. OdbMoveSchedulerJob Timer job
runs every 5 mins and picks up the entries in PDLChangedList and queues them to execute
remaining phases of the workflow.
2. Backup – OdbMoveSchedulerJob after picking up item from PDLChangedList processes the move
job by starting the backup phase. In the backup phase the site state is changed to ReadOnly to
avoid any updates and its metadata is backed up into Azure Blob and then a cross farm API call is
made to start the restore operation in the target farm. An entry is added to UserMoveWorkList
to initiate the Move Job.
3. Restore – In the restore phase the site is created on the target farm with the same properties as
in the source farm. After the restore workflow is finished then the target farm makes a cross
farm API call to source farm to notify the completion of restore phase.
4. Cleanup – In the cleanup phase the original site is deleted in the source farm and a new redirect
site is created to redirect any requests to the older url to the new site in the target farm.
5. Finalize – In the final phase of the Site Move workflow the move job entry is added to
UserMoveCompleteList and removed from both the PDLChangedList and UserMoveWorkList to
denote the completion of the move operation. It also makes a cross farm API call to move the
item to UserMoveCompleteList in the target farm. Entries in the UserMoveCompleteList is
automatically cleaned up after 180 days.
• Cross Geo Tenant Store Replication – Tenant store is used to store tenant related settings. For
multi-geo specific properties Cross Geo Tenant Store is used.
SPCrossGeoTenantStoreReplicationJobDefinition timer job is used to replicate cross geo tenant
store properties across different tenant instances.
Name - random2
Value - 98659480-c632-4ebc-8d0d-77f4c0c49ae5
Cross Geo Tenant Store property follows a specific naming pattern xgeo:[key]:[geo]:0e6c74c1-
9920-47ab-81bb-80a268fcabda. Name value pair will be like
Name - xgeo:georegularsitecount:can:0e6c74c1-9920-47ab-81bb-80a268fcabda
Value - {"Value":"41939","LastModifiedTimeInUtc":"2022-12-
24T11:15:38.0941725Z","IsDeleted":false}
• Taxonomy Replication – A TermStore contains zero or more Group objects, which are used to
organize Terms within TermSets. MultiGeoTaxonomyReplicationJobDefinition timer job is used
to replicate term store settings from default tenant instance to all the satellite instances.
1) Projected demand: DAU Forecasting of user growth for each geo, measured by DAU (Daily
Active Users)
2) Engineering efficiency: COREs/KDAU (cores per 1000 Daily Active Users) for each geo and each
hardware SKU
3) Usable capacity per zone pair (hardware order unit) for each hardware SKU
As explained in 3.1 Physical Topology, a zone consists of a fixed number of servers on a fixed number of
racks for each hardware SKU, connected via an Agg Router. To support SPO DR (Disaster Recovery), we
always order zones in pairs with each zone pair located in two different data centers physically apart
from each other.
Based on 1) and 2), we can calculate the number of cores needed each month for the next 12+ months.
Based on core forecasting and 3), the number of zone pairs are calculated for each geo to determine the
hardware order for the next fiscal year. Capacity PM (Product Manager) will order new hardware
including 6 months buy ahead buffer (reflected in dashboard of Incoming hardware).
Forecasting and hardware order also includes the capacity needed to replace the old hardware to be
decommissioned.
• New hardware (zone pair) dock: hardware lands in Microsoft data centers
• RTEG (Release to Engineering): Azure team will finish the networking and other basic setup of
the zone pair and hand over to SPO
• RTGM (Release to Grid Manager): SPO Fleet Management team will finish configuration of the
zone pair and get it ready for new farm deployment
• Capacity allocation: SPO capacity team decides what the new zone pair capacity will be used for
o New farm deployment
o Standby farm deployment as network move target
o GridManager and infrastructure farms migration
• Farm deployment: SPO deployment team will deploy the farms based on capacity allocation
requirements and release the new farms as Sev1 enabled farms
After new farm hand off from farm deployment team as Sev1 enabled farms, capacity team’s farm
open/close automation will open the farm for new tenant provisioning and use the farm for tenant
move target as well. When the farm utlization reaches a predefined threshold, farm open/close
automation will close the farm to block new tenant provisioning to this farm. We can also manaually
open/close farms for new tenant provisioning and exclude particular farms as tenant move target to
meet special business needs.
To effectively utilize compute resources and meet GoLocal business needs, we move tenants across
farm labels (stamps) within geo for load balancing and across geos for GoLocal requets. These will be
discussed in-depth in 11.1.4 Capacity balancing and tenant move and 11.1.5 GoLocal moves.
At this stage, we also have an Auto Capacity mechanism to help maintain active VMs to meet the
expected farm goals to be discussed in 11.1.3 Content farm size definition and auto capacity.
SPO bare metal hardware lifetime is 5.5 years by default. The end-of-life (EOL) date of a zone is based on
zone In-Service-Time (start time in service) plus 5.5 years. For a zone pair, the In-Service-Time of the two
zones usually are very close. We use the earlier one to determine the EOL date of the zone pair.
In special situations when we have to keep old zone pairs for an extended period of time, there is a
process to extend the hardware warranty and update the zone pair EOL date to a later time. This is part
of EOL and warranty management. Normally we do not extend zone EOL to more than 6 years.
Sellable capacity is the compute power measured by USR VM cores that we can use to support user
load.
“Buffer” is the 6-month buy ahead buffer as part of capacity ordering process to deal with uncertainty of
hardware arrival delay. It is also used to handle potential perf regressions. The “Actual” part refers to
the USR cores used for the current user load. “Waste” is the capacity not being used yet excluding the
buy ahead capacity. “Buffer” + “Waste” is the capacity available for future user load growth. Overall,
“Buffer” + “Waste” + “Actual” is the sellable capacity that we can use for customer traffic.
Engineering Reserves is the remaining capacity allocated for SPO internal infrastructure (BOT, SPODS,
GM farms, Infra roles) and maintaining service reliability (CB, HW buffer, DR, etc.).
While USR VMs serve user traffic, BOT VMs are used for internal jobs (see 3.2 Service Topology). Each
USR VM uses 16 cores, and each BOT VM uses 8 cores. For each content farm, USR VM count : BOT VM
count ratio is 2.5 : 1. From capacity point of view, USR cores : BOT cores ratio is 5 : 1.
SPODS is SPO Directory Service designed for SPO services and connected with AAD (Azure Active
Directory), see Active Directory SPODS. We allocate 5 SPODS VMs for each content farm (moving
towards 4 VMs per farm). Each SPODS VM uses a full Physical Machine regardless of SKU.
Grid Manager farms, Infrastructure and infraCore farms each has specific capacity requirements, but
they only need a small percentage of compute capacity compared to the overall SPO capacity.
A significant amount of capacity is used for SPO service reliability. For all sellable cores, we keep the
same amount of capacity to support Disaster Recovery (DR).
For each USR VM, we can use up to x% CPU as full capacity (x% is 65% as of Sept. 2022, moving towards
70%), this is where the “30% USR” category refers to as part of the Engineering Reserves.
SPO has on-going investments to reduce the capacity for Engineering Reserves, which will increase
utilization rate of compute capacity.
• Full size or half size content zone pairs without search farms
• Full size or half size search zone pairs with search farms only
• Full size or half size mixed zone pairs with both content and search farms
Content zone pairs may contain Grid Manager farms and Infrastructure farms for infra role VMs.
Half size zone pairs are normally used for small GoLocal geos with low usages. The majority compute
capacity uses full size zone pairs. After search farms are migrated to Substrate, search zone pairs and
mixed zone pairs will go away. We’ll use full size content zone pairs to discuss content farm size
definition. Other cases are variants of it with minor changes.
We use a VM slot as a unit for compute capacity. Each VM slot represents 8 cores. Here’s the capacity
used by each SPO VM role (reference 3.3 Virtual Machine Topology).
To calculate content farm size, i.e., the number USRs and BOTs per farm, we use the following formula:
Capacity for a content farm (VM slots) = (zone VM slots – hardware failure buffer – non-content farm capacity cost) / #
of content farms per zone
BOTs = Capacity for a content farm (VM slots) / 6 (where 1/6 capacity for BOTs and 5/6 capacity for USRs)
Ideally, we would like to have a standard farm size across SKUs. We used to define standard farm size as
160 USRs and 65 BOTs based on WCS Gen5. Since different SKUs have different number of cores per
zone, now we define farm size specific to each SKU to fully utilize zone pair capacity.
Farm size is subject to change depending on engineering efficiency improvements, such as hardware
failure buffer reduction and SPODS VM reduction.
Auto capacity job: the Auto Capacity job has two major functionalities.
1. Manage farm goals (USR and BOT goals): auto capacity monitors farm goals and fix incorrect
ones if two farms of a farm pair have different farm goals. When farm size definition changes
(e.g., from SPODS VM reduction), auto capacity will update farm goals of all production farms.
2. Keep the number of USR and BOT VMs to match farm goals. USRs and BOTs might be dead due
to different reasons. The physical servers on which these VMs are created could also turn
unhealthy. When the USRs and BOTs VMs are below the farm goals, the auto capacity job will
automatically deploy new VMs to meet the farm goals.
Over time, some farms may have user load going closer to the farm capacity (limited by the number of
USRs). Capacity balancing will move some tenants from these farms (“hot farms”) to “colder farms” with
lower usage so that we can keep user load below farm capacity.
Capacity balancer is capable of handling multiple criteria using linear optimization, e.g., considering SQL
cost for the moves besides reducing USR load in hot farms. The execution for balancing is not always
sequential. Capacity balancer generates move plans for multiple batches based on the above logic, while
tenant move batches normally run in parallel.
While tenant move is the major solution for capacity balancing, we also leverage farm open/close
mechanism to reduce tenant moves for balancing. When farm utilization reaches a predefined level,
farm open/close automation will automatically close those farms so that new tenants are provisioned in
low load farms, especially the newly deployed farms.
Based on historical data, we have expected load increase for EDU tenants during back-to-school (BTS)
season. Capacity balancer will calculate expected load for BTS season and starts proactive balancing a
few months ahead of time. During BTS season, we are in reactive mode to balance unexpected load
spikes.
Like SQL capacity protection, we have compute capacity protection. We can define protection at either
tenant level or database level. Based on the expected user load growth (spike), if there is a risk that the
farm load will go over farm capacity, we will either move the protected tenant or DB to a low load farm
or move other tenants (DBs) out of the farm. There are multiple criteria to make the decision of which
tenants and DBs to move. Normally we avoid moving large tenants with a lot of users due to time cost to
finish the moves.
Compute protection may require emergency moves when we get information too close to the date of
expected user load spike. There are other situations which require emergency moves, such as capacity
For a GoLocal tenant in a shard DB, we must split the opt-in tenant out of the original shared content DB
to a new content DB so that we can move it to the GoLocal geo separately without impacting the other
tenants in the shared DB. After the DB split, we rely on tenant move to complete the move from the
source farm in the Macro-Geos to a farm in the GoLocal geo.
For a dedicated GoLocal tenant, DB split is not needed. We move all dedicated DBs of the tenant to the
GoLocal geo in a single tenant move batch.
Current SLA for GoLocal move is 24 months after the opt-in window is closed. After opt-in window is
closed for a GoLocal geo, there can be ad hoc special requests for GoLocal moves. We handle those
requests on demand.
GoLocal Playbook
GoLocal geo capacity management is challenging due to a lot more uncertainty of user load and limited
resources (SPO compute capacity, Azure storage and Azure SQL capacity). In case a GoLocal geo is
running out of capacity, we’ll follow the process of M365 Go-Locals Playbook (PM owner: Rebecca
Gee) to mitigate the capacity shortage. The playbook covers:
To support the process, we keep track of tenants that we can potentially move out of each GoLocal geo
in case of emergency, including tenants’ compute, blob and SQL usages. With data at hands, we can
make decisions quickly about which tenants we should move out under specific capacity constraints.
Cleanup
TM Plan
Generation PreStage Move
Rollback
TM plan generation is the starting point of tenant move automation for capacity balancing, GoLocal
moves, and ad hoc TM requests. After the TM plan is created, the TM execution goes through the
following phases.
• PreStage: prepares the move which mainly focuses on 1) dual syncing tenant’s AAD data into
both the source farm and the target farm 2) continuously copying tenant’s SPO content from the
source farm to the target farm
• Move: flips the tenants from the source farm to the target farm. The flip usually happens during
the off-peak hours of tenant’s Geo
• Cleanup: remove tenants and corresponding content DBs from the source farm
• Rollback: move tenants back to the original source farm before cleanup starts in case something
goes wrong in the middle of the PreStage phase or move phase
Cross-RM TM: to support moving tenants across Regional Manager (RM), TM needs to replicate grid
metadata for the tenants from the source regional manager to the target regional manager. It requires
additional logic to communicate between farms in two different Reginal Managers. This is accomplished
by building cross-RM technology on top of service bus mechanism provided by Gird Manager team.
Tenant move ReadOnly time happens in Move phase, which includes the following key tasks:
• Failover Azure SQL DB from source farm to target farm: < 2 minutes
• Restamp tenants: < 2 minutes
• GMUpdate: < 1 minute
Failover: changes primary DB from source farm to target farm, makes the content DB read-write in
the target farm and read-only in the source farm
Restamp: updates tenant’s DNS pointing to the clump in the target farm
GMUpdate: updates grid metadata to switch the tenants from the source farm to the target farm
Network move (NM) is a gesture to move all tenants in a content farm from one zone to another
zone, typically to move a farm label from one zone pair to another zone pair.
There are two scenarios for network move, one is to move farms within Azure region and the other is
cross region move.
• Same region network move dismounts the SQL servers and DBs from the old farm and remount
them to the new standby farm, which will avoid setting up GeoDR continuous copy and save SQL
COGs during the move.
• Cross region move must create GeoDR DBs in the other region. The databases in the old farms
will be dismounted and removed after the move.
12.2 Active-Active
What is Active-Active: In Active-Active architecture, two independent systems actively run the same
service simultaneously. User traffic goes to both systems. In case one system is down, user traffic is
redirected to the other system. After the unhealthy system is recovered, user traffic is balanced
between the two Active-Active systems again.
Performance: By utilizing the capacity allocated to both farms, the user traffic is split between the two
farms, which results in lower CPU usage on the web server which in turn leads to better performance.
Failover granularity: in SPO Active-Active model, databases are organized into database clumps. In case
one DB or a limited number of DBs have health issues, we can failover individual clumps to the other
farm instead of doing full farm failover. This will avoid customer impact on healthy clumps.
Databases are organized into database clumps. A DB clump is the unit of failover and routing. The
diagram below shows the relationship between Azure SQL databases, SPO logical DBs and DB clumps.
By default, a database clump will have up to 5 logical databases. However, no tenant can span multiple
DB clumps which means that some clumps have more databases because some tenants have more than
5 dedicated databases. For example, the MSIT tenant in the ProdBubble MSIT_US_1_Contant farm has
over 1000 DBs in a clump and the Accenture tenant in the US_201_Content farm has over 500 DBs in a
clump.
Clump balancer evaluates user load of each clump based on the DB level user load from compute
capacity management. The balancer has multiple balancing strategies and tries to apply the best
balancing strategy for clump failover. Since the user load is not evenly distributed among content DBs
and clumps, we cannot reach a perfect 50:50 distribution ratio. Clump balancer’s job is to find the
optimal solution to minimize the user load difference between the two farms of each farm pair.
Balancing configuration: Due to live site incidents or special business needs, we may have to disable
auto clump balancing. Here are the basic rules:
Special balancing: besides the default 50:50 balancing rule, we support special balancing requirements:
Balancing history: To help investigate live site issues, we have clump balancing history stored in
SPOReports database and provide a FarmBalanceReport dashboard.
For cases with impact on failover, alerts are integrated into DR readiness. The clump health monitor also
has auto-heal ability to fix a set of known problems.
• RTO (recovery time objective) is the time it takes to restore the availability of SPO starting from
when availability was lost until the customer regains full access to their content. In most
months, SPO can restore availability using failover in less than 30 minutes at P95 of the
databases.
• RPO (recovery point objective) is the amount of data loss sustained if there were a failover. The
data loss is caused by the data replication lag between geo locations. SPO keeps this number as
small as possible. If the RPO of a database becomes larger than 30 minutes, an escalation will be
fired. If the RPO of any database is more than 60 minutes, failover will not be allowed to
proceed.
This capability does not come for free. For each farm label, SPO builds two farms which are at least 250
miles apart. Each farm contains its own compute machines (USR and BOT). Each content database also
sets up a continuous copy database in the other farm. Continuous copy is a SQL Azure feature which
replicates all changes in a database to a geographically remote copy. The source database is called the
primary database, which will allow read-write access. The remote copy is called the geo-secondary
database which is read-only. Figure 12-2 Primary and Recovery Farms and Databases shows the
relationship between the farm pairs and database pairs. SQL Azure is responsible for replicating every
change that happens to the primary database to the recovery database. There is usually a small delay
before the change appears on the recovery database. This delta is the source of data loss when a
failover happens before some change could be replicated. As we know, this time delta is also called the
RPO. SPO monitors database RPO closely and will fire alerts if the RPO is over certain thresholds. More
information about SQL GeoDR and replication could be found here - https://docs.microsoft.com/en-
us/azure/azure-sql/database/active-geo-replication-overview.
12.3.3 DR Dashboard
DR Dashboard, accessible to all ODSP engineers at drdashboard.azurewebsites.net, is a web site which
provides information on all aspects of disaster recovery in SPO. DR engineers and SRE often use it to find
information about ongoing failovers, or farm DR readiness. Engineers who are doing incident
postmortem also go there to study the failovers in the recent past. The web site keeps detailed
information about each failover, such as the time each step was completed, error information from the
processing jobs, etc. The following picture is a screen shot from the world map view on this dashboard.
In this picture things are looking good overall with only two data centers with some farms that are not
ready for failover.
A scheduled failover is a planned failover that is normally scheduled in advance and can be postponed if
something unexpected happened. It will require OCEs/SREs to create scheduled failover activities before
failover. Those failover activities will tell other gestures to yield. Scheduled failover uses friendly failover,
which allows SQL Azure to finish all replication to guarantee zero data loss; however, if the failover gets
stuck it will elevate to forced failover and allow data loss.
A proactive failover is a response to emerging risk which will likely become a sev 2 or more severe
incident within a short amount of time. Like scheduled failover, proactive failover will attempt to
failover with zero data loss before elevating to forced failover.
The most aggressive type is called Unscheduled failover, which is usually a response to sev-2, sev-1, or
sev-0 incidents. It likely will incur a small amount of data loss during failover as it will use forced failover
at the SQL level.
(1) Pre-Failover. Check traffic lights to see if the failover can proceed (more on this later).
(2) Failover. In this stage, the process promotes the recovery database in all the database
pairs to become the geo-primary (the original primary to be the new geo-secondary).
Also in this step, the failover process will update the DNS entry which brings traffic to
the farm to go to the corresponding recovery side. In some cases, this step would launch
the failover of other service components, such as the SPODS forest and the directory
database.
Figure 12-4
DR Home Page availability – this is a signal from active monitoring which is making requests to a
SharePoint homepage on a test tenant. This tests the overall health of the whole stack including
Capacity – looks at the amount of USR & BOT VMs to ensure there is enough available capacity
in the other farm to handle all the traffic after a failover.
DB Health –First, it checks the database probe which connects to the recovery database and
does a simple query to check if the DB is available. Second, it checks that the database is
mounted to SharePoint.
RPO 60 Min – This ensures that the recovery database has the data replicated to within 60
minutes of the primary database.
Alerts – integrates with other alerts which help tell whether DR side is healthy.
A farm will be failed over automatically if one of the following criteria is met:
• Farm level active monitoring success rate < 85% for 5 minutes
• Farm level active monitoring success rate < 90% for 10 minutes
• Farm level active monitoring success rate < 95% for 30 minutes
There is also a trigger the database level if the primary database is down and existing DB auto-heals
cannot bring it back online.
The following table shows the percentage of failovers automated in the first half of 2022.
To address these challenges, in 2019 we started our journey to build the next version of caching
platform. The first thing was to find a replacement for velocity which should be highly available/reliable,
scale-able, provide high throughput, secure & compliant, well-proven, ready to use & officially support
and most importantly apt for our caching scenarios both in terms of speeds and access pattern.
In SPO, at a high level, our caching scenarios can be pivoted by payload sizes; small (<=8KB), large
(>=512KB) and medium in between, or by their access patterns like single value puts/gets, multi-values
get, optimistic concurrency etc. The majority of our caching scenarios fall under small payload + single
value puts/gets.
With these pre-requisites in mind, we did not have many choices for caching technology, unless we want
to build one ourselves, however, remember we want something which is ready to use and building a
distributed cache from scratch is not something trivial. But what about open-source caches like
MemcacheD and Redis? Unfortunately, these technologies are more geared towards Linux. Yes, they
did have or at least had a windows version at that time, but the support story wasn’t very coherent,
breaking our requirement of officially supported.
These clusters are deployed as a part of content farm deployment via Grid jobs and are co-located
within the same region as the content farm for better latencies. Additionally, we also configure cluster
level alerts, IP filters, set up static routes to support leak-routing and enable DNS hardening.
13.1.5 Security
The client to server authentication happens via per cluster connection strings and the communication
happens using SSL over TCP. By default, all our payloads are encrypted before transmitting over the wire
using per farm secret key. Additionally, we provide key scope verification to avoid cache leakage.
Both Redis connections strings and secret keys are rotated periodically, furthermore, we configure
firewalls rules on each cluster to keep the system secure and compliant.
13.1.9 Scaling
By default, Azure-Redis does not provide elastic scaling, however, we have built mechanisms where we
can manually scale in/out or scale up/down a Redis cluster based on our needs.
To solve this problem, we created a cache client library, which is both host & caching tech agnostics. At a
high level, there are 2 components
1. Cache provider:is a set of APIs/functionality specific to a particular caching tech, e.g., for Redis it
is the set of apis, exposed by StackExchange.Redis 2.0.601.
2. Pattern based caches: Instead of programming against the cache provider’s specific apis, the
library exposes common caching pattern constructs like IDistributedCache which supports
simple puts/gets, or IDistrbutedCacheSlidingWindowCounter to support a global atomic
counter, IDistributedCachePubSub etc. This gives us the ability to use multiple caching providers
like Redis and SSD to serve better our caching needs, while being completely transparent to the
callers.
Additionally, the library provides default functionality of serialization, compression, encryption etc.,
which if needed can be extended at a caching scenario level and lot more performance and reliability
optimizations. For detailed information please visit SPO Distributed Cache Wiki
13.1.12 SLA
We provide 99.99 availability and 5ms @P95th for small payloads of <=8KB.
13.3 SSDCache
SSDCache is SPO in-house build organic key\value pair distributed cache solution. It leverages un-used
SSD disk space on USR VMs to build a per farm cache cluster; by persisting cache object on the disks in
the cluster, it helps on-boarded component to save cost and improve performance.
• Performance:
o SSDCache provides an extremely low latency, especially for large payload. All cache data
is persisted and transferred within SPO farm. And we are trying to maximize the
connection reusability (currently >99% of cache requests lend on existing connections).
For <1M payload, the P95 get latency is <10ms; and for <4M payload, the P95 get
latency is <30 ms;
• COGS:
o SSDCache cluster is built using existing USR VMs, no additional hardware required, no
transactional cost for the on-board scenarios;
o All data is persisted\transferred within farm only, almost no network cost
14.1 Primary Azure Tools Used for the SPO Instrumentation, Monitoring and Alerting
ecosystem
SPO leverages the Geneva Monitoring ecosystem from Azure which provides the technology foundation
for our Instrumentation, Monitoring and Alerting capabilities. These same data are also used for the
ODSP Business Metrics (Analytics) capability as well. Read more about the Geneva Monitoring
Infrastructure here.
• Instrumentation: data that is logged and then get processed, in some cases aggregated and then
stored. Logs and Metrics (defined below) are examples of instrumentation. Instrumentation
can come from our servers, clients or even synthetic monitoring systems.
• Monitoring: the process whereby the aforementioned Instrumentation data are used to assess
and then optionally act on service health issues.
• Alerting: the active and automated process whereby the aforementioned Instrumentation data
are conditionally assessed and if static or dynamic thresholds are met, generates a ticket
through the Microsoft Incident Management system known as ICM. These tickets, based on
rules, generate either paging (phone call) or non-paging (email) alerts.
Our IMA ecosystem generally categorizes these data into two broad types – Logs and Metrics:
• Logs: Logs are the raw schematized data that are emitted and then processed and stored by
Geneva, usually called MDS. Logs are typically queried by the dgrep interface provided by
Geneva. Getting access to the SPO logs is described in the Wiki here.
• Metrics: A Metric is an aggregated measurement of an event or series of events that have
occurred over period of time, generally over a set of defined dimensions. A simple example is
the Total number of Successful (aka TotalSuccess) Executions (the Metric) for a given API (the
Dimension). These data are nodal-aggregated in one-minute increments across a pre-defined
set of dimensions, which is called a Preaggregate.
A few other key terms and technologies that are used by this ecosystem include:
• CIL: Our Common Instrumentation Library, which is a nuget package implemented wrapper
around IFX which is a SDK used by Geneva to collects logs and metrics on nodes.
• MA: The Geneva Monitoring Agent, which is a nodal process that is responsible for uploading
logs and aggregated metrics to the Geneva Ingestion service, thereby enabling Monitoring and
Alerting.
• Geneva Dashboards and Widgets: Geneva Dashboards are collections of Widgets which
generally show temporal summaries of Metrics, and is the primary tool used by SPO for
Monitoring the service. Dashboards are created and accessed from the Geneva site.
• DGrep: Geneva DGrep is the primary interface used to query our Logs. It is also accessed from
the Geneva site.
The security of our product is always improving or degrading – there is no “status quo”. As we build new
features and leverage new technologies, we introduce new risk. Likewise, as threat actors become more
capable it allows existing, hidden risks to become visible and exploitable.
To achieve our goal of being the safest cloud for the world’s most valuable data, we invest in Security
Fundamentals – a continuous process of risk identification, monitoring, and reduction:
• Security reviews protect the product as designed. Our goal is to empower engineers to achieve
business outcomes without increasing risk. We strive for an attitude of “yes and here’s how”
rather than a “culture of no”.
• Security monitoring protects the product as implemented. It allows us to uncover new risks
soon after they are introduced, and to detect abuse of existing risks that haven’t yet been
mitigated.
• Security investments are made by every team at every layer of the product to make bad
security outcomes less likely or less impactful.
• Security research keeps us grounded in reality. Pen-test exercises and bug bounty submissions
tell us the truth about our product and where we need to improve. Industry research helps us
identify trends or new classes of vulnerability that we need to stay ahead of.
We do this work with a “buck stops here” mentality – we leverage the work of others where possible
but never outsource the accountability for protecting our customers.
• Engineering systems
o Code changes require Yubikey proof-of-presence to ensure that a trusted engineer
signed off on the changes in the PR.
o Builds execute on trustworthy servers and build outputs are signed.
o Only signed code is allowed to execute on high-risk roles such as domain controllers.
• Service management
o The service is designed with breach boundaries such that an intrusion in a SharePoint
farm cannot spread to other farms, limiting the impact to <1% of customers.
o Grid Manager uses certificate authentication and Job Agent rather than domain
accounts or Remote PowerShell to prevent credentials from being stolen.
o The Windows firewall is configured to restrict network connectivity between
environments.
Our goal is to detect unauthorized access within 10m and evict the intruder within 3h. We use a variety
of agents and microservices to accomplish this:
• Windows compute is monitored using the HostIDS security agent which captures ETW telemetry
and sends it to a real-time service called Observer.
• Azure storage, SQL, and Key Vault resources are monitored using a service called Azure Log
Monitor which consumes audit events from every data plane interaction with these services.
• SharePoint emits ETW events for each incoming S2S request. These are captured by the HostIDS
security agent and analyzed in the Observer pipeline to detect activity by a stolen first-party
identity.
• We consume many other sources of data such as Azure DevOps activity, AzureRM control plane
activity and Azure Networking syslog activity to detect unauthorized activity.
• Detection results from every source are ingested into an in-memory graph database called
ClusterBot.
• SPO engineers review detection results daily to identify unauthorized activity.
• HostIDS agents process 300 million events per hour across 500K servers.
• These agents send 130K Windows security observations per hour to the Observer service.
• Azure Log Monitor processes 13 billion data plane events per hour
• Our systems flag 4000 suspicious events per hour out of these telemetry streams
• Of these, roughly 25 per hour are sent for human review.
• On average, high-risk activity results in a paging alert to the on-call engineer every 4 days.
We exercise detection and response in a yearly red team engagement to ensure that our tech, people,
and processes are all prepared to handle a real threat actor.
As SRE’s we flip between the minutiae of server level disk write latencies to the macro view of how to
ensure the redundancy of critical ODSP services during unprecedented outages, at enormous scale.
Obsess on service excellence by understanding real life customer experience based on extensive
availability and reliability signals. Through the provision of a highly available and reliable product, SRE
aims to empower every ODSP user to achieve more.
• Livesite
o Spread across three regions (Redmond/Dublin/Suzhou) whose OCEs take responsibility
for Farm/Datacenter/Regional level outages.
• Incident Management
o Develop, implement, and support best in class incident management practices across
SPO.
• Insights
o Analyzing Petabytes of ODSP data and telemetry to surface insights that drive reliability,
performance, and efficiencies back into the service.
• Customer Response Team
o The last line of support for Microsoft’s largest SPO customers, leveraging SRE practices to
drive optimal user experience.
• Performance & Debug
o Deep diving into the most difficult of SPO outages and investigations ensuring durable
improvements make their way back into many areas of the product.
Service level Objectives (SLOs) are typically defined as a SLI threshold / range: A range of values of a
service level indicator that needs to be maintained for the service to be acceptably reliable.
It is recommended that the SLOs defined using these SLIs need to be specified as SMART goals: In that
they need to be Specific, Measurable, Achievable, Relevant, and Time-bound, for precision and ability to
be clearly measured.
At a high level for SPO, we are targeting availability at different indicators depending on the scenario.
Farm availability is expected to be at 99.99% and latency below 1000ms. Incident Severity is determined
by factors such as failure rate and impact duration.
16.4.3 Observability
SRE farm level monitoring is scoped to 4 buckets:
• Availability and Latency using active monitoring.
• Outside-in availability based on a 3rd party service.
• Passive monitoring based on SPO usage data.
• Database down monitoring based on probes and performance indicators.
- Complete RCA in areas of SRE expertise like SQL / Network / Performance and generate
descriptive repair items to the relevant feature teams.
- Partner across ODSP/M365/Azure feature teams on RCA investigations that are ambiguous or
span multiple areas of responsibility.
- Reporting in different forums such as Livesite Health Review and Monthly Service Review
surfacing key metrics like
o Alerts trends
o Recurring issues
o RCA completion rates
- Postmortems
o Larger scope incidents require a postmortem to be completed which is driven out of
SRE. Part of this process is a follow up mechanism which ensures repair items which
were committed to are delivered.
The following people have contributed content to this book, in alphabetical order.