Autopilot: Automatic Data Center Management: Microsoft Research Silicon Valley
Autopilot: Automatic Data Center Management: Microsoft Research Silicon Valley
ABSTRACT
Microsoft is rapidly increasing the number of large-scale web services that it operates. Services such as Windows Live Search and Windows Live Mail operate from data centers that contain tens or hundreds of thousands of computers, and it is essential that these data centers function reliably with minimal human intervention. This paper describes the first version of Autopilot, the automatic data center management infrastructure developed within Microsoft over the last few years. Autopilot is responsible for automating software provisioning and deployment; system monitoring; and carrying out repair actions to deal with faulty software and hardware. A key assumption underlying Autopilot is that the services built on it must be designed to be manageable. We also therefore outline the best practices adopted by applications that run on Autopilot.
resulting from an attempt to fix an earlier problem. As more failure management is moved to automated scripts, there is less variability in the response to faults, and thus we can hope to make the entire system more reliable and maintainable. The first version of Autopilot described here concentrates on the basic services needed to keep a data center operational: provisioning and deployment; monitoring; and the hardware lifecycle including repair and replacement. Autopilot supplies mechanisms to automate all of these services, however policy for example, determining which computers should run which software, or precisely defining and detecting failures that need to be repaired is mostly left to individual applications. Converting legacy applications to work with new automatic management software is a challenging problem. Autopilot has the luxury of starting from a fresh application base, supporting mainly systems that were built to conform to Autopilots design principles. Most of the technology used in Autopilot components is similar to designs that have appeared in previously reported work. Our overall approach to fault tolerance follows the Recovery Oriented Computing model outlined in [3], and we adopt the crash-only software methodology proposed in [4]. Our software deployment strategy fits the framework advanced in [1]. There has been much recent interest in autonomic computing, and a survey of commercial work in this area is given in [5], but the goals of that community are more ambitious than those of Autopilot. Autonomic computing looks forward to the day when most configuration policy will be controlled automatically. As noted above, this paper describes mostly mechanism to support manually-determined policies. The factors that underlie our desire to implement fault-tolerance in software on top of a large number of unreliable commodity computers are similar to those described in [2]. Many of the design practices we follow are standard software-engineering methodology. Nevertheless, it is still a challenge to combine all of these ideas into a fully automatic system that can manage tens of thousands of computers 24 hours a day for years at a time without any planned downtime for the system as a whole. This paper gives an overview of Autopilots structure, but it does not attempt to fill in the details: it would be impractical to try to supply enough information here to allow even a skilled practitioner to re-implement the whole system. Instead we have tried to abstract and explain some of the high-level design principles we adopt that let us write and maintain complex software deployed in large-scale modern data centers. Section 2 outlines our design philosophy. Section 3 describes the typical hardware configuration of our data centers. Section 4 gives an overview of Autopilots component structure, and Sections 5 to 7 examine each component in more detail. Section 8 provides a brief case study showing how one application interacts with Autopilot, and we discuss some lessons we learned along the way in Section 9.
General Terms
Management, Design
Keywords
Automatic management, Cluster computing
1. INTRODUCTION
Microsoft is rapidly expanding the scope of its web-scale online services. Windows Live Search was re-launched using an internal back-end in January 2005, and Windows Live Mail (formerly Hotmail) has seen a large growth in storage capacity over the last few years. Several more web-scale services are currently in development, and the total number of server computers managed by Microsoft has increased very quickly over the last few years, and will continue to grow. The sudden growth of Microsofts data center capacity at the same time that several new service back-ends are being developed has given us an opportunity to design a new in-house infrastructure for automatic data center management. This infrastructure is known as Autopilot. Its design was primarily motivated by the need to keep the total cost of a data center, including operational and capital expenses, as low as possible. This is partly achieved by using more intelligent software to replace much of the repetitive work previously handled by operations staff. We aim to maintain as few people as possible on 24-hour call: our most efficient services support many thousands of computers per member of operations staff, and run on an 8x5 rather than 24x7 support schedule. Increased reliability is an equally important benefit of automation. Many data center failures are caused by human error, often
The named author participated in the design of several Autopilot components, however this paper is primarily a report on the work of others. The original conception, the vast bulk of the design, and all the implementation of Autopilot were undertaken by product groups at Microsoft, led by the Windows Live Search core team.
2. DESIGN PRINCIPLES
Traditionally, reliable systems have been built on top of faulttolerant hardware. The economics of the contemporary computing industry dictate, however, that the cheapest way to build a very large computing infrastructure is to amass a huge collection of commodity computers. In exchange for lower capital expenditure compared with the traditional approach, this results in hardware that is much more prone to failures. This as an opportunity to move more fault-tolerance into software, but we must employ consistent design principles in order to be confident about the reliability of the applications we deploy. The two most important principles underlying the Autopilot design are fault tolerance and simplicity: Since any component can fail at any time, the system must be reliable enough to continue automatically with some proportion of its computers powered down or misbehaving. All vital state must be replicated, and any necessary fail-over must be completely automatic. We aim to minimize critical dependencies between components so that a temporary fault in one service does not become a single point of failure and disable an entire cluster. The basic failure model we have assumed is non-Byzantine. This is a consequence of the controlled environment within our data centers. Data corruption problems can generally be managed using checksums, so Byzantine faults tend to arise when some replicas violate a protocol contract. Although this type of failure does occur, it is more likely to be caused by bugs than by the malicious hijacking of a small number of processes. Consequently, problems are typically not confined to a minority of replicas, and so Byzantine fault-tolerant algorithms are of limited usefulness. We briefly revisit this issue in Section 9. We believe that simplicity is as important as fault-tolerance when building a large-scale reliable, maintainable system. Often this means applying conservative design principles: in many cases we rejected a complex solution that was more efficient, or more elegant in some way, in favor of a simpler design that was good enough. This requires constant discipline to avoid unnecessary optimization, and unnecessary generality. Simplicity must always be considered in the context of the entire system, since a solution that looks simpler to a component developer may cause nightmares for integration, testing, or operational staff. Simplicity is also manifested in more basic ways. Where components are configurable, the parameters are stored in human-readable plain text files that are under the management of the same version control system as source code and documentation. (Autopilot components never use the Windows Registry.) If a change in configuration is made, this is done by a deployment procedure (see Section 5.2) with an audit trail. We discourage the use of any interactive control channel to a process that would allow configuration changes to be made without generating an audit trail.
Where correctness is at stake we attempt not to cut corners even when it introduces extra complexity. Of course every design is only as correct as the assumptions on which it is based, including the assumption that there are no bugs in the code. No algorithm can provide hard guarantees of correctness in a practical system built on physical hardware, so it is impossible to do more than provide best effort service. We can however distinguish between designs that make their assumptions explicit and those that make implicit assumptions, for example that failures will not happen in pathological combinations. In a large system one must expect all combinations of failures, so we aim to use designs whose assumptions we understand, but that are simple enough that we can hope to find all crucial bugs through careful review and testing. At the same time we accept that no implementation is foolproof, and the best we can achieve is to optimize for a tolerable level of risk. Our fault tolerance strategy requires that components must be designed so any process can be killed unexpectedly without destabilizing the system. Most of our components therefore treat forced termination as the only exit mechanism and can consequently omit clean shutdown code. Because processes must be able to tolerate crashes, we are able to use assert statements very liberally and, along with the resulting debugging benefits, this can also help to simplify a design since there is no need to try to recover from a damaged invariant. This is a special case of a general principle of avoiding seldom-used failure paths in our programs. As explained in the Introduction, it was not a design requirement that we provide all of the benefits of Autopilot to legacy code. Legacy applications often assume reliable hardware, but by designing new applications with automation in mind we can move much of this reliance into software and thus reduce hardware complexity and cost. Partly, this means advocating to application designers the same principles of fault tolerance and simplicity that are adopted in the Autopilot components. Applications must expect their processes to be killed without warning, and where possible, customer-facing services should continue, perhaps with degraded operation, in the face of even large numbers (e.g. 50%) of failed computers. Applications must use Autopilot interfaces for reporting errors in order to benefit from automatic monitoring and failure management. Applications must also be easy for Autopilot to install and configure: in practice this means that an application configuration must be entirely specified by files in the local file system of the computer where the application is running.
3. HARDWARE CONFIGURATION
In common with other contemporary data center operators, we buy and install computers in quantities of at least a rack. Each computer conforms to one of a small set of standard specifications including, for example, an application configuration with several processor cores and a few direct-attached hard drives, and a storage configuration with more disks per processor. A typical application rack might contain 20 identical multi-core computers, each with 4 direct-attached hard drives. Also in the rack is a simple switch allowing the computers to communicate locally with other computers in the rack, and via a switch hierarchy with the rest of the data center. Finally each computer has a management interface, either built in to the server design or accessed via a rack-mounted serial concentrator. This ensures that, at a minimum, it is possible for a remote software component to
power each computer on and off and install new Operating System images. The set of computers managed by a single instance of Autopilot is called a cluster. At this point, the largest deployed Autopilot clusters contain up to tens of thousands of computers, though many are much smaller. There may be more than one cluster in a data center, but we aim as far as possible to eliminate interdependencies so that a failure in one Autopilot cluster is unlikely to affect other services. Application Components
Cockpit
Collection Service
Figure 1: A schematic of the Autopilot system and applications. Arrows show the flow of communication. The Device Manager (Section 4) is the central system-wide authority for configuration and coordination. The Provisioning Service and Deployment Service (Section 5) ensure that each computer is running the correct operating system image and set of application processes. The Watchdog Service and Repair Service (Section 6) cooperate with the application and the Device Manager to detect and recover from software and hardware failures. The Collection Service and Cockpit (Section 7) passively gather information about the running components and make it available in real-time for monitoring the health of the service, as well as recording statistics for off-line analysis. (These monitoring components are Autopiloted like any other application, and therefore communicate with the Device Manager and Watchdog Service which provide fault recovery, deployment assistance, etc., but this communication is not shown in the figure for simplicity.) satellites and the Device Manager in such a way that the satellites can keep their state weakly consistent without compromising correctness. Satellite services receive information from the Device Manager using a pull model. They regularly send lightweight messages to the Device Manager that report their current status, and in response are sent the relevant parts of the current Device Manager ground truth. The use of regular heartbeat messages makes the
design very robust to transient failures, since individual messages can be lost without affecting eventual correctness. Sometimes a state transition in the Device Manager will cause it to kick remote services to request that they send a pull query immediately. This is simply an optimization that lets us ensure that most satellite computers quickly learn about any required actions, but any computers that do not receive the kick will still learn of the state change through a later heartbeat. An alternative push model would require the Device Manager to keep state for every message recording which clients had so far received it: we decided that the extra network traffic and latency incurred by the pull design was an acceptable tradeoff in exchange for a simpler Device Manager.
what application binaries to fetch and run, as described in the following sections. The computers name is determined by its position in the network hierarchy, which in turn is determined by the rack (and slot in that rack) where it is plugged in. It is up to the operator to ensure that the correct hardware configuration is installed in each rack slot. The Provisioning Service is configured to use several computers for redundancy. These cooperate to elect a leader that carries out the appropriate actions. The Provisioning Service is stateless, and any necessary information is retrieved from the Device Manager when the leader starts up. Weak consistency of the deployment state during fail-over may cause some actions to be attempted more than once, but this can be tolerated since any resulting problems will eventually be detected and corrected by the normal activity of the repair services.
5. LOW-LEVEL SERVICES
A small number of operating system images are in use at any time in a cluster. We use only stable commercial releases of Windows Server operating systems. Each image also contains some basic Autopilot configuration files, and some Autopilot-specific Windows services, pre-installed and enabled on boot. The configuration files contain, for example, the DNS names of computers running core Autopilot components. The Windows services are able to communicate with centralized Autopilot components to ensure that the correct application processes are installed and running on the computer. Network configuration and name services are currently managed independently of Autopilot using a standard replicated installation of Active Directory [7]. Every computer runs a local service, supplied by Autopilot, that ensures the correct files are present on its local disk. This filesync service is used extensively by Autopilot and applications to transfer data between computers. The service acts both as a client requesting files from remote machines, and as a server handling such client requests. By using a dedicated service rather than relying on an operating system component (such as the standard Windows remote file access features) we gain better control over logging of file transfers, and the ability to throttle transfers so a computer or switch does not become unexpectedly overloaded. A second local service on every computer, called the application manager, makes sure that the correct processes are running. Each application process is distributed as a standalone directory containing all binaries, shared libraries and configuration files necessary to for the process, along with a standard start.bat script that can be invoked to start it. There is no clean shutdown code so a process is stopped simply by instructing Windows to kill it and its children. The application manager reads a configuration script and ensures that the designated binaries are running. A process can be configured to run continuously (so it is restarted if it exits for any reason) or periodically, e.g. once an hour.
We refer with slight abuse of terminology to a computer storing a manifest, meaning that the computer stores the files listed in that manifest.
manifests are up to date, so the Device Manager contains both a central record of what versions should be on each computer, and a weakly consistent view of the versions that are actually present.
computer or network switch rather than a process, and the only available remedies are Reboot, ReImage, and Replace, described in more detail in Section 6.2 below. This ensures that Autopilot does not need to contain logic to attribute blame to a particular process or component, or consider application-specific recovery actions.
6.1 Watchdogs
Faults are detected using a set of watchdogs. A watchdog probes one or more computers to test some attribute, and then reports to the Device Manager. The watchdog reports OK, Warning, or Error for the attribute on each computer, along with an arbitrary descriptive reason string for the latter two. The set of watchdogs is extensible; the definition of a watchdog is simply any piece of code that understands how to contact the Device Manager using the watchdog protocol. This is a simple plain-text protocol so it is easy to write a watchdog in a scripting language. The Device Manager can compute a transient error predicate for any computer using the watchdog attributes: if any watchdog reports Error, the computer is in error; if all watchdogs report either OK or Warning, the computer is not in error. This predicate is used to drive the state machine outlined in Section 6.2. The Warning status is used to report unexpected but non-fatal conditions. The audit history of warnings can be useful, for example, during the postmortem analysis of an unexpected event, but a warning does not automatically trigger any Autopilot action. We could have built an alert system that would contact an operator when it detected warnings. However, a fundamental goal of Autopilot is to avoid burdening operations staff with the task of monitoring and understanding alerts, or taking remedial actions. We therefore dont want to encourage developers to use warnings as a lifeline to a human: rather, the system should be designed to react to problems automatically. Applications sometimes do need to generate alerts, but they are typically triggered by information integrated from multiple computers or components (see Section 7). Some watchdogs are supplied by Autopilot and are run either on every computer locally or on a set of computers called the Watchdog Service. These standard watchdogs include periodic checks that every computer is running the right Operating System image and manifest; and queries to the computers BIOS to detect disk or memory error conditions. Other watchdogs are supplied by the application. There is no need to limit the number of watchdogs, so new watchdogs are often added to address specific scenarios. For example, at one point it was discovered that some computer configurations would spontaneously lose track of half of their DRAM and consequently start paging until the symptom was cured with a reboot. This was addressed with a custom watchdog to periodically probe for the issue. The Device Manager error predicate is the conjunction of all the watchdogs for a computer, and once a computer is held to be in error, it may be unavailable for a substantial period. It is therefore important to minimize false positives in watchdogs. On the other hand, when there is a fault that is detectable using a watchdog there may be minutes of latency before the Device Manager discovers it and takes action. Components that require low-latency fault mitigation therefore typically implement custom softfailure detectors and Section 8 explains this in more detail with reference to a specific example component.
Healthy
an error is reported a deployment action is performed
Probation for too long it will be moved back to Failure triggering another repair action. When the Device Manager takes some action such as code deployment (Section 5.3) that is likely to generate applicationspecific watchdog errors, the set of affected machines is moved from Healthy to Probation before the rollout is started. The rollout is deemed to be successful on a computer if it undergoes the normal transition back to Healthy. If a computer stays in Probation for too long then the rollout has failed. This re-uses the existing failure-detection machinery for rollout monitoring, while ensuring that computers do not get assigned a black mark in their repair history due to watchdog errors resulting from planned actions. By centralizing all repair action decisions in the Device Manager state machine Autopilot is able to throttle the number of machines under repair at any time, and therefore protect against, for example, a faulty watchdog causing all computers in a cluster to be simultaneously rebooted.
Failure
Probation
Figure 2: A simplified diagram of the failure/recovery state machine. The Device Manager records its estimate of the state of each computer in the cluster as Healthy, Failure or Probation. Transitions described in bold text occur as a side effect of other Device Manager actions. Transitions described in italics occur when a timer expires. The state machine is described in Section 6.2.
7. MONITORING SERVICES
Autopilot components, and applications built to run on Autopilot, record performance counters and logs in a standard location on every computer. Performance counters are used to record the instantaneous state of components, for example a time-weighted average of the number of requests per second being processed by a particular server. Performance counter histories are useful for off-line trend analysis, but real-time values are also invaluable to give operators a current view of the state of the system and help in the diagnosis of any unexpected issues. Logs are mostly used to record individual component actions that can be correlated later in off-line processing. The Collection Service forms a distributed collection and aggregation tree for performance counters and logs. It can generate a centralized view of the current state of the clusters performance counters with a latency of a few seconds. Individual counters can be aggregated, for example across an entire machine type, in order to keep the volume of low-latency data manageable. The Collection Service also lazily collects detailed performance counters and logs and writes them to a large-scale distributed, replicated file store where they are available for off-line datamining. Real-time performance-counter information is kept in a SQL database so that sophisticated statistics can be computed for visualization and diagnosis simply by issuing the appropriate relational queries. Cockpit is a visualization tool that lets operators monitor one or more Autopilot clusters using graphs and reports generated from the performance counter databases. It is easy to store default views, or construct custom queries to drill down into a particular issue. Cockpit also serves as a gateway allowing operators to fetch arbitrary log files from individual computers. Together with the ability to monitor computers performance counters, this provides an audited access mechanism that eliminates most requirements for direct access to data center computers. There is an automated Alert Service that sends emails or pages to support staff based on application-defined relational queries against the Cockpit database. These queries can capture system-wide properties by aggregating data from many computers and sub-components.
fault-tolerance therefore seems justified. Our staged rollout procedure does introduce new code (along with its new potential bugs) on a small fraction of a machine types computers at first. This model fits Byzantine fault-tolerant assumptions very well, and some applications may in future adopt more complex fault tolerance strategies if the risk of state corruption justifies it. Autopilot contains a number of hand-set thresholds defining the policy for deployments, probation state timeouts, etc. We are currently recording large amounts of data logging the actions that Autopilot takes, along with performance counters capturing the state of client applications when those actions are taken. We are experimenting with machine learning algorithms to analyze these data in order to understand how to improve the policy settings, with the ultimate goal of automating many of the current manual policies. As with all large-scale deployments, we have encountered failures of every type that we expected, and some we didnt. It is vital to keep checksums of all crucial files (for example machine-type manifests) since they will become corrupted. Checksums also allow the detection of hand-edited configurations that were temporarily changed, for example as part of a debugging investigation. Without such automatic detection it is very hard to prevent gradual configuration drift in a large system. TCP/IP checksums are weak, and messages will be silently corrupted unless they are protected by additional applicationlevel checksums. Networking hardware will malfunction and start flipping large numbers of bits; this both causes a storm of retries, and makes it inevitable that some errors will remain undetected by TCP. Computers will spontaneously start running very slowly, but keep making progress, so systems need to tolerate and detect this as well as fail-stop errors. Throttling and load shedding are crucial in all aspects of an automated system. Failure detectors must be able to distinguish between the symptoms of failure and overloading, otherwise overloaded computers may be marked as failed and removed from service, amplifying the problem and triggering a cascade of failures that disables the entire application. Autopilot has been continuously operating its oldest clusters since the pre-release Windows Live Search engine was deployed in 2004. Over that time we have rewritten many components and substituted them in place, without ever bringing down a major production cluster for planned maintenance. Autopilot supports all forthcoming large-scale deployments inside Microsoft, and
some legacy services have already been ported to run on Autopilot clusters. Autopilot supports a vastly lower cost of management than legacy Microsoft services, with a very high level of reliability. Up to this point, there has been no major outage of a customer-facing service that can be directly attributed to an Autopilot failure.
10. ACKNOWLEDGMENTS
As mentioned in the Introduction, the design and implementation of Autopilot was led by the Windows Live Search core team. Many members of that team have shared comments and advice to help make this an accurate and representative depiction of the system, however any remaining errors are the responsibility of the author. Autopilots success has only been possible due to the collaboration and hard work of many product teams spanning developers, testers, program management, and of course data center operational staff at all levels. I would also like to thank Darren Shakib, Kevin Kaufmann, Martn Abadi, Mike Schroeder, Andrew Birrell and John MacCormick for many helpful comments on improving the content and presentation of the paper.
11. REFERENCES
[1] Ajmani, S., Liskov, B. and Shrira, L. Modular Software Upgrades for Distributed Systems. 20th European Conference on Object-Oriented Programming, July 2006, 452476. [2] Barroso, L.A., Dean, J. and Holzle, U. Web Search for a Planet: The Google Cluster Architecture. IEEE Micro, 2003. [3] Brown, A. and D. A. Patterson. Embracing Failure: A Case for Recovery-Oriented Computing (ROC). High Performance Transaction Processing Symposium, October 2001. [4] Candea, G. and Fox, A. Crash-Only Software. 9th Workshop on Hot Topics in Operating Systems, May 2003, 6772. [5] Gentzsch, W., Iwano, K., Johnston-Watt, D. Minhas, M.A. and Yousif, M. Self-adaptable autonomic computing systems: an industry view. 16th International Workshop on Database and Expert Systems Applications, August 2005, 201205. [6] Lamport, L. The Part-Time Parliament. ACM Transactions on Computer Systems 16, 2 (May 1998),133169. [7] Microsoft Active Directory for Windows Server 2003. http://www.microsoft.com/windowsserver2003/technologies/ directory/activedirectory/default.mspx