MOF SMF Availability Management
MOF SMF Availability Management
Management Function
Availability Management
Release
Approved
Review
Release
SLA
Review MOF Readiness
Review
Operations
Review
Microsoft ®
Solutions for Management
The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the
date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a
commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of
publication.
This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR
STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part
of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means
(electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of
Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject
matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this
document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.
Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses, logos, people, places
and events depicted herein are fictitious, and no association with any real company, organization, product, domain name, email
address, logo, person, place or event is intended or should be inferred.
2002 Microsoft Corporation. All rights reserved.
Microsoft is either a registered trademark or trademark of Microsoft Corporation in the United States and/or other countries.
The names of actual companies and products mentioned herein may be the trademarks of their respective owners.
Contents
Document Purpose.................................................................................................. 1
Executive Summary ................................................................................................ 2
Process and Activities ............................................................................................. 3
Overview .............................................................................................................. 3
New IT Services ............................................................................................... 3
Existing IT Services ......................................................................................... 4
Goals and Objectives....................................................................................... 4
Scope ................................................................................................................. 5
Key Definitions ................................................................................................ 6
Major Processes ............................................................................................... 7
Define Service Level Requirements .................................................................. 8
Define Critical Customer Functions ............................................................. 8
Define Availability Requirements................................................................. 9
Propose Availability Solution.......................................................................... 10
Identify Major Information Technology Service Components ............... 11
Design for Availability ................................................................................. 11
Availability Risks and Countermeasures................................................... 12
Life Cycle Management Needs.................................................................... 16
Design for Recovery...................................................................................... 18
Incident Life Cycle ........................................................................................ 19
Designing for Customer Satisfaction During Outages............................. 22
Management Processes................................................................................. 22
Formalize Operating Level Agreements ........................................................ 23
Roles and Responsibilities.................................................................................... 25
Availability Manager ........................................................................................ 25
Relationship to Other Processes .......................................................................... 27
Service Level Management .............................................................................. 27
Financial Management ..................................................................................... 28
Workforce Management................................................................................... 28
Service Continuity Management..................................................................... 28
Capacity Management...................................................................................... 28
Change Management........................................................................................ 28
Contributors........................................................................................................... 30
Document Purpose
This guide provides detailed information about the availability
management service management function (SMF) for
organizations that have deployed, or are considering deploying,
Microsoft technologies in a data center or other type of enterprise
computing environment. This is one of the more than 20 SMFs
defined and described in Microsoft® Operations Framework
(MOF). The guide assumes that the reader is familiar with the
intent, background, and fundamental concepts of MOF as well as
the Microsoft technologies discussed.
An overview of MOF and its companion, Microsoft Solutions
Framework (MSF), is available in the Introduction to Service
Management Functions guide. This overview guide also provides
abstracts of each of the service management functions defined
within MOF. Detailed information about the concepts and
principles of each of the frameworks is also available in technical
papers available at www.microsoft.com/solutions/msm.
2 Service Management Function Guide
Executive Summary
Availability has become one of the most important aspects of
service delivery in the highly visible e-business global economy.
Consequently, the demand for 24-hours-a-day, 7-days-a-week
operation is greater than ever. Availability, or the lack of it, has a
dramatic influence on customer satisfaction and can very quickly
impact the overall reputation and success of the enterprise.
Availability management is responsible for ensuring that service-
affecting incidents do not occur, or that timely and effective
action is taken when they do.
Risks to availability may be caused by technology, processes and
procedures, and human error. Countermeasures, such as
carefully designed testing and release procedures and
appropriate staff training plans, can be employed to help
mitigate these risks. Risks to availability exist throughout the
whole IT infrastructure and within every management process.
Although not directly responsible for each of these processes,
availability management is responsible for making sure that all
areas of risk to availability are taken into account and that the
overall IT infrastructure and the maturity of management
processes supporting a given IT service are sufficient.
Availability management and service continuity management are
closely related in this respect as both processes strive to eliminate
risks to the availability of IT services. The prime focus of
availability management is handling the routine risks to
availability that can be reasonably expected to occur on a day-to-
day basis. Rare, expensive, or unanticipated risks are handled by
service continuity management.
Availability Management 3
New IT Services
New IT services provide the best opportunity for achieving
availability targets in a cost-effective manner because availability
considerations can be built in from the earliest stages. This allows
the most appropriate technologies to be selected and an IT
support infrastructure to be built that provides the required level
of operational maturity.
The customer and IT organization have the best opportunity in
this scenario to work closely together on the definition and level
of availability to be provided by the IT service and to agree upon
the level of investment required. This avoids inappropriate
expectations from emerging in the future and allows any
mismatches between the levels of availability and the investment
required to be resolved early.
The aim of new IT services is to achieve the desired availability
targets from day one and to successfully manage the levels of
availability throughout the life cycle of the solution. It is
particularly important to manage levels of availability during the
introduction of the functional and technological changes
demanded by today’s fast-moving business environments.
4 Service Management Function Guide
Existing IT Services
Existing IT services can have their availability levels significantly
improved or stabilized through the adoption of a formal
availability management process. They can then benefit from an
ongoing continuous improvement process and careful
management of future changes.
The challenge of improving availability levels in existing IT
services is they often come with a legacy of design constraints
and technology challenges that may not be cost-effective to
overcome. That is why building availability in from the very
beginning is so important.
The life cycle approach is very similar to new IT services and
begins with a definition of availability with the customer and
determination of an appropriate budget for improvements and
ongoing maintenance that can be justified by the cost of
downtime.
In some respects, existing IT services have an advantage over
new IT services in that they have a track record of service
delivery that can be examined in detail and any shortcomings
and areas of exposure addressed. The availability design process
includes an investigation of the history of service outage
experienced by the customer, as well as root cause analysis as
appropriate.
At any one time, the role of the availability manager invariably
includes both the improvement of existing IT services as well as
the introduction of new IT services. In addition, the introduction
of major change to an existing IT service, such as upgrading from
one technology platform to another, also involves mixing these
two scenarios. The basic process to be followed is the same in
either case.
Scope
Availability management is concerned with the design,
implementation, measurement, and management of IT
infrastructure availability to ensure that stated business
requirements for availability are consistently met. In particular
● Availability management should be applied to all new IT
services and for established services where service level
requirements (SLRs) or service level agreements (SLAs) are
established.
● Availability management can be applied to IT services that
are defined as critical business functions, even when no
SLA exists.
● Availability management can be applied to the suppliers
(internal and external) that form the IT support
organization as a precursor to the creation of a formal SLA.
● Availability management considers all aspects of the IT
infrastructure and supporting organization that may
impact availability, including training, skills, policy,
process effectiveness, procedures, and tools.
● Availability management is not responsible for Business
Continuity Management and the resumption of business
processing after a major disaster. This is the responsibility
of the service continuity management SMF. However,
availability management is closely related and provides
key inputs to service continuity management.
6 Service Management Function Guide
Key Definitions
The following are key definitions within the availability
management processes:
Availability. Ability of a component or service to perform its
required function at a stated instant or over a stated period
of time.
Countermeasures. Actions taken to prevent or reduce the
effect of an identified risk.
Critical business functions. The critical elements of the
business process supported by an IT service.
Downtime. The unavailability of the IT Service during hours
that the business deems the systems to be available—as
advertised within SLAs.
End-to-end service. All components of the IT Infrastructure
required for delivering an IT service.
High availability. Minimizing or masking component
failures.
Incident life cycle. An availability technique which analyses
the broken down stages of an incident to allow for timing
and measurement of each stage.
Maintainability. The ability of an IT infrastructure
component to be retained in, or restored to, an operational
state.
Operating level agreement. An internal agreement covering
the delivery of services that support the IT service provider
in the delivery of services.
Risk Management. The identification, selection, and
implementation of countermeasures to the identified risks
to assets to reduce them to an acceptable level.
Reliability. The freedom from failure of services and
components over a given period of time.
Serviceability. The contractual arrangements made with
Third Party IT service providers to provided or maintain IT
Services or components.
Service level agreement. Written agreement between a service
provider and the customer(s) that documents agreed
service levels for a service.
Service outages. See downtime.
Availability Management 7
Major Processes
Availability Management comprises of three main processes and
a number of subprocesses as follows:
● Define service level requirements
● Define critical customer functions
● Define availability objectives
● Propose availability solution
● Identify major Information Technology service components
● Design for availability
● Availability risks and countermeasures
● Life cycle management needs
● Design for recovery
● Incident life cycle
● Designing for customer satisfaction during outages
● Management processes
● Formalize operating level agreements
Start
Define Service
Level Objectives
Propose
Availability
Solution
Formalize
Operating Level
Agreements
End
Figure 1
Availability process flow diagram
8 Service Management Function Guide
● Facilities domain:
● Insufficient air-conditioning capacity.
● Power outages.
● Power surges and spikes.
● Fire and flood.
● Physical security.
● Egress domain:
● Single power feed from utility.
● Single communications feed from Telco.
● Personnel:
● Poor quality procedures.
● Lack of discipline.
● Lack of skills.
Availability Management 13
Is failure NO
expected and the Create contingency plan
countermeasure
affordable?
YES
Design and implement
countermeasure
Incident occurs
Is there a NO
countermeasure
and did it work?
YES
Business as usual
Figure 2
Relationship between availability management and service continuity
management
16 Service Management Function Guide
Incident
occurs
Normal service
resumed
Figure 3
Incident life cycle
Availability Management 19
The time taken during each of these stages affects the overall
period of downtime due to this incident and the availability of
the IT service as a whole. Designing for recovery is concerned
with the efficient handling of each stage in this life cycle for every
IT component involved in the support of critical business
functions and transactions.
Duration of incident
(downtime)
Elapsed
time to Response Repair Recovery Time between failures
detection time time time (uptime)
Time
Figure 4
Incident life cycle showing time between failures
Incident Diagnosis
The moment at which the true cause has been identified as
opposed to any initial symptoms. Diagnosis includes the time
taken to respond to the detected event and to identify
appropriate resources to work on identifying the cause and to get
them into a position where they can interact with the system.
Once engaged, specialists need access to a knowledge base of
known problems, accurate configuration information, recent
change history, appropriate diagnostics tools, and an effective
escalation path and contacts list.
Incident Repair
The moment at which any underlying failure or system issue has
been repaired or worked around.
Using the earlier incident examples, repair might mean the
replacement of an IT component, the restoration of power,
implementation of an emergency application fix, or the restart of
a server. Considerations include off-hours call rotation,
appropriate contracts with internal groups and external vendors,
spare equipment on site, and so on.
Repair does not mean that the IT service is fully available once
more or indeed that it is even back up and running.
Incident Recovery
The moment at which any recovery has been completed and the
IT component is ready to resume normal processing.
For example, a replacement disk drive needs to have its data
restored either from backups or from an on-line process before it
can be used for production. Considerations include the provision
of detailed recovery processes for IT components and the
maintenance of appropriate interrelationships and dependencies.
Incident Restoration
The moment at which normal service is restored and the business
function or transaction becomes fully available.
This requires synchronization with the customer and a means of
communication with all users.
22 Service Management Function Guide
Management Processes
Many of the considerations for availability in regard to
management processes are already covered by previously
mentioned availability design activities. The impact on the
availability of IT services of the management processes within
the MOF model is significant enough to warrant separate
mention.
Availability management needs to ensure that the MOF
processes used for the support of critical IT services are mature
enough and have the necessary people, skills, and tools to
effectively undertake their respective responsibilities. The design
process should look in detail at each of the management
processes involved in the support of the IT service being
considered.
An effective tool to help with this responsibility and also with the
complete availability design process is an availability review or
assessment service from an outside organization specializing in
availability management, ITIL, and MOF. Such a service can help
establish a baseline of maturity for any existing IT infrastructure
and compare and contrast this to the needs of new or existing IT
services being deployed.
Availability Management 23
Availability Manager
Main Responsibilities
The availability manager is responsible for managing the
activities of the availability management process. This individual
is responsible for ensuring that any given IT service delivers the
levels of availability agreed upon with the customer and for
interfacing with all other management processes in pursuit of
this goal.
The availability manager:
● Ensures customer requirements are correctly translated
into realistic availability goals.
● Ensures appropriate IT budgets are established for
protecting the service.
● Oversees planning activities in relation to designing for
availability and designing for recovery.
● Ensures that all risks to availability are identified and
appropriately handled.
● Undertakes availability modeling to help select the most
appropriate countermeasures, assesses the impact of future
changes, and identifies potential improvements.
● Implements cost-effective countermeasures to single points
of failure where possible.
● Ensures that remaining gaps are identified to the customer
and ultimately handled by service continuity management
when required.
● Ensures that the overall IT infrastructure is mature enough
to support the availability needs.
26 Service Management Function Guide
Figure 5
Microsoft Operations Framework optimizing quadrant
Financial Management
Financial management acts a filter, ensuring that solutions
proposed by availability management, capacity management, or
service continuity management can be justified in terms of their
cost to implement versus their benefit to the customer. Financial
management strives to monitor, control, and, if necessary,
recover costs incurred by the IT organization.
Workforce Management
Whenever a new technology is introduced into the IT
environment, the people that run that technology must be
properly trained and motivated. Workforce management ensures
that existing personnel are trained and ready to operate a new
availability solution when it is ready.
Capacity Management
Capacity management ensures that appropriate IT resources are
available to meet customer requirements by planning for
additional resources as current system resource use begins to
near the point of full capacity. Availability management has a
very close tie to this process, since optimal use of IT resources to
meet performance levels at a justifiable cost relates to the result
of effective availability management. Availability reporting and
measurement highlight availability trends indicating capacity or
performance issues.
Change Management
Capacity management ensures that appropriate IT resources are
available to meet customer requirements by planning for
additional resources as current system resource use begins to
near the point of full capacity. Availability management has a
very close tie to this process, since optimal use of IT resources to
Availability Management 29
Contributors
Many of the practices that this document describes are based on
years of IT implementation experience by Accenture, Avanade,
Microsoft Consulting Services, Fox IT, Hewlett-Packard
Company, Lucent Technologies/NetworkCare Professional
Services, and Unisys Corporation.
Microsoft gratefully acknowledges the generous assistance of
these organizations in providing material for this document.
Program Management Team
William Bagley, Microsoft Corporation
Jeff Yuhas, Microsoft Corporation
Lead Writers
James Westover, Hewlett Packard Corporation
Ashley Hanna, Hewlett Packard Corporation
Contributing Writers
William Bagley, Microsoft Corporation
Vicky Howells, Fox IT
Jeff Yuhas, Microsoft Corporation
Editors
Patricia Rytkonen, Volt Technical Services
Sybil Wood, Volt Technical Services