Introduction To IBM® Power® Reliability, Availability, and Serviceability For POWER9® Processor-Based Systems Using IBM PowerVM™
Introduction To IBM® Power® Reliability, Availability, and Serviceability For POWER9® Processor-Based Systems Using IBM PowerVM™
Introduction to IBM Power® Reliability, Availability, and Serviceability for POWER9® processor-based
systems using IBM PowerVM™
With Updates covering the latest Power10 processor-based systems
Section 5: Serviceability....................................................................... 89
Service Environment .............................................................................................................. 89
Service Interface .................................................................................................................... 89
First Failure Data Capture and Error Data Analysis .............................................................. 90
Diagnostics............................................................................................................................. 90
Automated Diagnostics ..................................................................................................................... 90
Stand-alone Diagnostics ................................................................................................................... 90
Concurrent Maintenance ....................................................................................................... 91
Service Labels ....................................................................................................................... 91
QR Labels .............................................................................................................................. 91
POWER9 POWER9
POWER9 Power10
1s and 2s IBM Power
IBM Power IBM Power
System
Systems^ System E980 E1080
E950
Yes – Power10
Multi-node SMP Fabric RAS design removes
• CRC checked processor fabric bus retry active
N/A N/A Yes
with spare data lane and/or bandwidth components on
reduction cable and
introduces
Uses IBM memory buffer and has spare DRAM Yes – New
No Yes Yes
module capability with x4 DIMMs* Memory Buffer
Yes -
Active Memory Mirroring for the Hypervisor No Yes - Base Yes- Base
Feature
Yes, For
Processors-
Redundant/spare voltage phases on voltage DDIMMs use on
Both redundant
converters for levels feeding processor and custom No Redundant board power
and spare
memory DIMMs or memory risers. management
integrated
Circuits (PMICs)
Yes – New
Redundant processor clocks No No Yes
Design
* In scale-out systems Chipkill capability is per rank of a single Industry Standard DIMM
(ISDIMM); in IBM Power E950 Chipkill and spare capability is per rank spanning across an
ISDIMM pair; and in the IBM Power E980, per rank spanning across two ports on a
Custom DIMM.
The Power E950 system also supports DRAM row repair
^ IBM Power® S914, IBM Power® S922, IBM Power® S924
IBM Power® H922 ,IBM Power® S924, IBM Power® H924
$
Note: I/O adapter and device concurrent maintenance in this document refers to the hardware capability
to allow the adapter or device to be removed and replaced concurrent with system operation. Support
from the operating system is required and will depend on the adapter or device and configuration
deployed.
Power10
Power10 Power10
1s and 2s
IBM Power System IBM Power System
IBM Power System E1050 E1080
S10xx
PCIe hot-plug with processor integrated PCIe Yes Yes – with blindswap Yes – with blindswap
controller No Cassette cassette cassette
2U DDIMM – no
spare DRAM Yes - base Yes - base
4U DDIMM (post 4U DDIMM – 2 spare 4U DDIMM – 2 spare
Dynamic Memory Row repair and spare DRAM
GA) – 2 spare DRAM per rank DRAM per rank
capability
DRAM per rank
Yes – Dynamic Row Yes – Dynamic Row
Yes – Dynamic Row Repair Repair
Repair
Yes – Base
Active Memory Mirroring for the Hypervisor Yes - Base Yes - Base
New to scaleout
Integrated Spare
Processor Clocks Redundancy/Sparing No Redundant
(post GA)
Yes – Op Panel
No – Op Panel Base Yes – Op Panel Base
Concurrent Op-Panel Repair Base
Yes – LCD Yes - LCD
Yes - LCD
(post GA)
Not illustrated in the figure above is the internal connections for the SMP fabric busses which will be
discussed in detail in another section.
In comparing the Power E1080 design to the POWER9-based Power E980 system, it is also interesting to
note that the processor clock function has been separated from the local service functions and now
resides in a dual fashion as separate clock cards.
The POWER8 multi-drawer system design required that all processor modules be synchronized across all
CEC drawers. Hence a redundant clock card was present in the system control drawer and used for all
the processors in the system.
The memory subsystem of the Power E1080 system has been completely redesigned to support
Differential DIMMs (DDIMMs) with DDR4 memory that leverage a serial interface to communicate
between processors and the memory.
A memory DIMM is generally considered to consist of 1 or more “ranks” of memory modules (DRAMs). A
standard DDIMM module may consist of 1 or 2 ranks of memory and will be approximately 2 standard
“rack units” high, called 2U DDIMM. The Power E1080 system exclusively uses a larger DDIMM with up
to 4 ranks per DDIMM (called a 4U DDIMM). This allows for not only additional system capacity but also
room for additional RAS features to better handle failures on a DDIMM without needing to take a repair
action (additional self-healing features).
Power E1080 CEC Drawer to CEC Drawer SMP Fabric Interconnect Design
The SMP Fabric busses used to connect processors across CEC nodes is similar in RAS function to the
fabric bus used between processors within a CEC drawer. Each bus is functionally composed of eight bi-
directional lanes of data. CRC checking with retry is also used. ½ bandwidth mode is supported.
Unlike the processor-to-processor within a node design, the lanes of data are carried from each
processor module through internal cables to external cables and then back through internal cables to the
other processor.
Physically each processor module has eight pads (four on each side of the module.) Each pad side has
an internal SMP cable bundle which connects from the processor pads to a bulkhead in each CEC drawer
which allows the external and internal SMP cables to be connected to each other.
Figure 5: SMP Fabric Bus Slice
Where the illustration above shows just the connections on one side of the processor.
In addition to connecting with the bulkhead, each cable bundle also has connections to an SMP cable
Validation Card which has logic used to verify the presence and type of a cable to help guide the
installation of cables in a system.
Though it is beyond the scope of this whitepaper to delve into the exact details of how TDR works, as a
very rough analogy it can be likened to a form of sonar where when desired, a processor module that
drives a signal on a lane can generate an electrical pulse along the path to a receiving processor in
another CEC drawer. If there is a fault along the path, the driving processor can detect a kind of echo or
reflection of the pulse. The time it takes for the reflection to be received would be indicative of where the
fault is within the cable path.
For faults that occur mid-cable, the timing is such that TDR should be able to determine exactly what field
replaceable unit to replace to fix the problem. If the echo is very close to a connection, two FRUs might
be called out, but in any case, the use of TDR allows for good fault isolation for such errors while allowing
the Power10 system to take advantage of a fully passive path between processors.
System Structure
A simplified view of the Power E1050 system design is represented in the figure below:
The E1050 maintains the same system form factor and infrastructure redundancy as the E950. As
depicted in the E1050 system diagram below, there are 4 Power Supplies and Fan Field Replaceable
Units (FRU) to provide at least N+1 redundancy. These components can be concurrently maintained or
hot add/removed. There is also N+1 Voltage Regulation Module (VRM) phase redundancy to the
Processors and redundant Power Management Integrated Circuit (PMIC) supplying voltage to the 4U
DDIMM that the E1050 offers.
The E1050 Op Panel base and LCD are connected to the same planer as the internal NVMe drives. The
Op Panel base and LCD are separate FRUs and are concurrently maintainable. The NVMe backplane
also has 2 USB 3.0 ports, accessible through the front of the system, for system OS. Not shown in the
diagram, are 2 additional OS USB 3.0 ports at the rear of the system, connected through the eBMC card.
USB Op Panel
Ports Base
N N N N N N N N N N
Fan FRU Fan FRU Fan FRU Fan FRU V V V V V V V V V V
M M M M M M M M M M LCD
e e e e e e e e
DASD Backplane
Fan
Fan
Rotor Rotor
Fan
Rotor
Fan
Rotor
Fan
Rotor
Fan
Rotor
Fan
Rotor
Fan
Rotor
e e
Power
Distribution Service Processor V V Processor in Dual Chip Module
Battery
V V V V V V V V V V V V V V
RTC
Card eBMC R R
R R R R R R R R R R R R R R
SYS M M M M M M M M M M M M M M M M
VPD
Memory DDIMMs
System Planar
Main Planar Board
PCIe Slot C3
PCIe Slot C5
PCIe Slot C6
PCIe Slot C7
PCIe Slot C8
PCIe Slot C9
PCIe Slot C1
PCIe Slot C4
NVMe2,7,3,8,4,9
NVMe0,5,1,6
System VPD
POWER9 Power10
RAS impact
E950 Memory E1050 Memory
• P10 4U DDIMM: Single FRU or fewer components to
Riser card replace
DIMM Form Factor 4U DDIMM • E950 DIMM: Separate FRU used for memory buffer on riser
plus ISDIMMs
card and the ISDIMMs
• P10 4U DDIMM
o 1st chip kill fixed with spare
o 2nd chip kill fixed with spare
One spare o 3rd chip kill fixed with ECC
DRAM per Two spare o 4th chip kill is uncorrectable
X4 Chip Kill
port or across DRAM per port • E950 DIMM
a DIMM pair o 1st chip kill fixed with spare
o 2nd chip kill fixed with ECC
o 3rd chip kill is uncorrectable
NOTE: A memory ECC code is defined by how many bits or symbols (group of bits) it can correct. The
P10 DDIMM memory buffer ECC code organizes the data into 8-bit symbols and each symbol contains
the data from one DRAM DQ over 8 DDR beats.
DASD Options
The E1050 provides 10 internal NVMe drives at Gen4 speeds. The NVMe drives are connected to DCM0
and DCM3. In a 2S DCM configuration, only 6 of the drives are available. A 4S DCM configuration is
required to have access to all 10 internal NVMe drives. Unlike the E950, the E1050 has no internal SAS
drives. An external drawer can be used to provide SAS drives.
The internal NVMe drives support OS-controlled RAID0 and RAID1 array, but no hardware RAID. For
best redundancy, the OS mirror and dual VIOS mirror can be employed. To ensure as much separation
as possible in the hardware path between mirror pairs, the following NVMe configuration is
recommended:
a.) Mirrored OS: NVMe3,4 or NVME8,9 pairs
b.) Mirrored Dual VIOS
I. Dual VIOS: NVMe3 for VIOS1, NVMe4 for VIOS2
II. Mirrored the Dual VIOS: NVMe9 mirrors NVMe3, NVMe8 mirrors NVMe4
The IBM Power10 E1050 system comes with a redesigned service processor based on a Baseboard
Management Controller (BMC) design with firmware that is accessible through open-source industry
standard APIs, such as Redfish. An upgraded Advanced System Management Interface (ASMI) web
browser user interface preserves the required enterprise RAS functions while allowing the user to perform
tasks in a more intuitive way.
Equipping the industry standard BMC with enterprise service processor functions that are characteristic of
FSP based systems, like the E1080, has led to the name Enterprise BMC (eBMC). As with the FSP, the
eBMC runs on its own power boundary and does not require resources from a system processor to be
operational to perform its tasks.
The service processor supports surveillance of the connection to the Hardware Management Console
(HMC) and to the system firmware (hypervisor). It also provides several remote power control options,
environmental monitoring, reset, restart, remote maintenance, and diagnostic functions, including console
mirroring. The BMC service processors menus (ASMI) can be accessed concurrently during system
operation, allowing nondisruptive abilities to change system default parameters, view and download error
logs, check system health.
Redfish, an Industry standard for server management, enables the Power Servers to be managed
individually or in a large data center. Standard functions such as inventory, event logs, sensors, dumps,
and certificate management are all supported with Redfish. In addition, new user management features
support multiple users and privileges on the BMC via Redfish or ASMI. User management via
Lightweight Directory Access Protocol (LDAP) is also supported. The Redfish events service provides a
means for notification of specific critical events such that actions can be taken to correct issues. The
Redfish telemetry service provides access to a wide variety of data (e.g. power consumption, ambient,
core, DDIMM and I/O temperatures, etc.) that can be streamed on periodic intervals.
System Structure
There are multiple scale out system models (MTMs) supported. For brevity, this document focuses on
the largest configuration of the scale out servers.
The simplified illustration, in Figure 10, depicts the 2S DCM with 4U CEC drawer. Similar to the S9xx,
there is infrastructure redundancy in the power supplies and fans. In addition, these components can be
concurrently maintained along with the Op Panel base, Op Panel LCD, internal NVMe drives and IO
adapters.
USB Op Panel
Ports Base
N N N N N N N N N N N N N N N N
V V V V V V V V V V V V V V V V Concurrently replaceable
M M M M M M M M M M M M M M M M LCD
e e e e e e e e e e e e e e e e DASD Backplane Concurrently replaceable and
collectively can provide at least n+1
redundancy
Fan FRU Fan FRU Fan FRU Fan FRU Fan FRU Fan FRU Concurrently replaceable (some
Fan
Rotor
Fan
Rotor Fan Fan
Fan
Rotor
Fan Fan Fan Fan Fan Fan
Rotor
Fan
Rotor
components shown)
Rotor Rotor Rotor Rotor Rotor Rotor
Rotor
Other V V
TPM + P10 P10 P10 P10
Service R R
USB Cntr Proc0 Proc1 Proc0 Proc1
Functions M M Memory DDIMMs
Clock
Circuitry Main Planar Board
Other components
Service Processor
Battery
RTC
System Planar
No PCIe Switches – all I/O slots have direct
connection to processor
Proc CP1 Proc CP0
PCIe Slot C1
PCIe Slot C7
PCIe Slot C8
PCIe Slot C2
PCIe Slot C9
PCIe Slot C4
Gen4x8 Gen5x8 Gen5x8 Gen5x8 Open CAPI Gen5x8 Gen4x8 Gen5x8 Gen5x8 or
Gen5x8 Gen5x8
or or Only or NVMe card
or or
Gen4x16 Gen4x16 NVMe
Gen4x16 Gen4x16
card
or
NVMe
Pwr Pwr Pwr Pwr card
Supply Supply Supply Supply
Figure 11: Power S1024 System Structure Simplified View With eSCM
USB Op Panel
Ports Base
N N N N N N N N N N N N N N N N
V V V V V V V V V V V V V V V V Concurrently replaceable
M M M M M M M M M M M M M M M M LCD
e e e e e e e e e e
e e e e e e DASD Backplane
Concurrently replaceable and
collectively can provide at least n+1
redundancy
Fan FRU Fan FRU Fan FRU Fan FRU Fan FRU Fan FRU
Concurrently replaceable (some
Fan Fan
Fan Fan
Fan Fan
Fan Fan
Fan Fan Fan Fan
Rotor
components shown)
Rotor Rotor Rotor Rotor Rotor Rotor Rotor
Rotor Rotor Rotor Rotor
Other V V
TPM + P10 P10 P10 P10
Service R R
USB Cntr Proc0 Proc1 Proc0 Proc1
Functions M M Memory DDIMMs
0-cores 0-cores
Clock
Circuitry Main Planar Board
Other components
Service Processor
Battery
RTC
PCIe Slot C1
PCIe Slot C7
PCIe Slot C8
PCIe Slot C2
PCIe Slot C9
PCIe Slot C3
PCIe Slot C4
POWER9 Power10
RAS impact
S9xx Memory S10xx Memory
• P10 2U DDIMM
Single DRAM Single DRAM o 1st chip kill fixed with ECC
chipkill chipkill o 2nd chip kill is uncorrectable
X4 Chip Kill • S9xx DIMM
correction, but correction, but
No spare DRAM No spare DRAM o 1st chip kill fixed with ECC
o 2nd chip kill is uncorrectable
NOTE: A memory ECC code is defined by how many bits or symbols (group of bits) it can correct. The
P10 DDIMM memory buffer ECC code organizes the data into 8-bit symbols and each symbol contains
the data from one DRAM DQ over 8 DDR beats.
DASD Options
The S10xx scale out servers provide up to 16 internal NVMe drives at Gen4 speeds. The NVMe drives
are connected to the processor via a plug-in PCIe NVMe JBOF (Just a Bunch Of Flash) card. Up to 2
JBOF cards can be populated in the S1024 and S1014 systems, with each JBOF card attached to an 8-
pack NVMe backplane. The S1022 NVMe backplane only supports the 4-pack, which provides up to 8
NVMe drive per system. The JBOF cards are operational in PCIe slots C8, C10 and C11 only. However,
the C8 and C10 slots cannot both be populated with JBOF cards simultaneously. As depicted in Figure
13 above, all 3 slots are connected to DCM0 which means a 1S system can have all the internal NVMe
drives available. While the NVMe drives are concurrently maintainable, a JBOF card is not. Unlike the
S9xx, the S10xx have no internal SAS drives. An external drawer can be used to provide SAS drives.
Battery
Processor Processor
RTC
RTC
In. In. Main Planar Board
Card Card
Other Interface Component
VRM
Pwr VRM Local Local Hot Plug
Pwr Supply
VRM
VRM VRM
...
VRM … … Local
Service
Functions
Local
Service
Functions
Circuits
Fan
Supply … Service Service
AC Input Source 1 Functions Functions Fan
Pwr
CDIMM
CDIMM
CDIMM
CDIMM
CDIMM
CDIMM
CDIMM
CDIMM
NVMe-C1 NVMe-C3
Pwr Supply Fan
NVMe-C2 NVMe-C4
Supply
Scale-up Scale-up …
Scale-up Scale-up Fan
Proc 0 Proc 1 Proc 2 Proc 3 Fan
Pwr
CDIMM
CDIMM
CDIMM
CDIMM
CDIMM
CDIMM
CDIMM
CDIMM
PCIe Slot C1
PCIe Slot C2
PCIe Slot C7
PCIe Slot C8
PCIe Slot C3
PCIe Slot C4
PCIe Slot C5
PCIe Slot C6
Pwr Supply Fan Fan
Supply
AC Input Source 2 Pwr Fan
Pwr Supply Fan
Supply
The design supports concurrent repair of the SMP cable for faults discovered during runtime that are
contained to the cable; for repairs performed before the system is rebooted.2
While the new fabric bus design for across node busses has a number of RAS advantages as described,
it does mean more cables to install for each bus compared to the POWER8 design. The POWER9 design
addressed this with a cable installation process using LEDs at the pull-tab cable connections to guide the
servicer through a cable plugging sequence,
Clocking
The POWER8 multi-node system required that each processor in the system use a single processor
reference clock source. The POWER8 processor could take in two different sources, providing for
redundancy, but each source had to be the same for each processor across all 4 CEC drawers.
This meant that the system control drawer contained a pair of “global” clock cards and the outputs of each
card had to be cabled to each CEC drawer in the system through a pair of local clock and control cards.
1
Running in half-bandwidth mode reduces performance not only for customer data but also hypervisor
and other firmware traffic across the bus. Because of this, there are limitations in the mid-November 2018
and earlier firmware as to which connections and how many busses can be at half-bandwidth without
causing a system impact.
2
Supported by at least firmware level 930. Concurrent repair is only for faults that are discovered during run-time and repaired
before system restart.
NVMe Drives
POWER8 Enterprise systems CEC drawers had no internal storage. This was primarily due to the
expectation that Enterprise customers would make use of a Storage Area Network (SAN) for both user
data availability as well as the storage used by VIOS which provided virtualized I/O redundancy.
As an alternative, external DASD drawer were also available in POWER8. However, feedback from a
number of customers expressed a preference for using for the VIOS root volume group even when a SAN
was deployed for everything else.
To accommodate this need, POWER9 Enterprise systems have the option of internal NVMe drives for
such purpose as VIOS rootvg storage. Each CEC drawer in the Power E980 system has 4 NVMe drive
slots. The drives are sold in pairs, with the expectation that the data on each drive will be mirrored
The drives are connected to processors two and three using PCIe busses from each as shown earlier.
Sys
RTC
PNOR Sys
VPD VPD Processor Processor/memory component
PCIe Slot C4
PCIe Slot C5
PCIe Slot C7
PCIe Slot C8
PCIe Slot C9
PCIe PCIe
Switch Switch Components integrated on planar
Module Module
Gen 3 Gen 3 Main Planar Board
x8 x4 Gen 3 Gen 3
x4 x1 Other Interface Component
To NVMe 1
To NVMe 3
To NVMe 2
To NVMe 4
PCIe Slot C6
USB
Ctrl
The memory Risers in the illustration above can be better illustrated by the following:
Memory Memory
Buffer Buffer
Memory Memory
Buffer Buffer
ISDIMM Slot ISDIMM Slot
ISDIMM Slot ISDIMM Slot
ISDIMM Slot ISDIMM Slot
ISDIMM Slot ISDIMM Slot
The circled portion of the memory riser and DIMMs would be replaced by a single custom enterprise
DIMM (CDIMM) in the Power E980 design.
In addition to not being as densely packaged as the CDIMM design, there are some differences in RAS
capabilities with the E850 memory which will be discussed in the next section on processor and memory
RAS.
I/O
Figure indicates that there are multiple I/O adapter slots provided. Some are directly connected to the
processor and others connected through a PCIe switch. There is a connection for a USB 3.0 adapter.
Generally, I/O adapters are packaged in cassettes and are concurrently maintainable (given the right
software configuration.)
Like the Power E980, two processors provide connections to the NVMe drives (as illustrated). In addition,
the system provides connectivity for SAS DASD devices. These are connected using SAS adapters that
can be plugged into slots C12 and C9.
DASD Backplane
S S S S S S S S S S S S Op Panel
A A A A A A A A A A A A Base
Pwr Pwr Pwr Pwr
Supply Supply Supply Supply S S S S S S S S S S S S
LCD
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
Concurrently replaceable and
collectively can provide at least n+1
VRM
…
redundancy
VRM Scale-out Scale-out Concurrently replaceable
Processor Processor
Concurrently replaceable (some
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
subcomponents shown)
Other
…
Serv ice FRU with sub-components shown
Functions
Battery
Sys
RTC
PNOR Sys Processor
VPD VPD Ctrl Spare sub-component within a FRU
Processor/memory component
PCIe PCIe
Switch I/O slot supporting concurrent repair
Switch System Planar
Module 1 Module2 Other Components
PCIe Slot C7
PCIe Slot C8
PCIe Slot C9
PCIe Slot C10
PCIe Slot C11
PCIe Slot C12
RTC Batter
Processor and
Service
Adapter
NVMe M.2 Card or SAS
Adapter
NVMe M.2 Card or SAS
Gen 3
x8
Main Planar Board
Note: There are differences in I/O connectivity between different models and there are multiple DASD
backplane options.
I/O Configurations
Like the Power E950, internal I/O slots have a mixture of slots directly attached to processors and those
connected through a switch.
Figure 21: Scale-out Systems I/O Slot Assignments for 1 and 2 Socket Systems
x8 x8 x8 x8
Slot
Slot
Slot
Slot
Slot
Slot
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O Module I/O Module
Mid-plane
Pwr/Temp
Ctrl
Fan Pwr
Fans Pwr
Gen3 I/O Drawer
These I/O drawers are attached using a connecting card called a PCIe3 cable adapter that plugs in to a
PCIe slot of the main server. In POWER9 these cable cards have been redesigned in certain areas to
improve error handling. These improvements include new routing for clock logic within the cable card as
well as additional recovery for faults during IPL.
The Power 10 cable card in the CEC module is also improved by using a switch instead of a re-timer. This
enables better fault isolation.
Each I/O drawer contains up to two I/O drawer modules. An I/O module uses 16 PCIe lanes controlled
from a processor in a system. Currently supported is an I/O module that uses a PCIe switch to supply six
PCIe slots.
Two different active optical cable are used to connect a PCIe3 cable adapter to the equivalent card in the
I/O drawer module. While these cables are not redundant, as of FW830 firmware or later; the loss of one
cable will simply reduce the I/O bandwidth (number of available lanes available to the I/O module) by
50%.
Infrastructure RAS features for the I/O drawer include redundant power supplies, fans, and DC outputs of
voltage regulators (phases).
The impact of the failure of an I/O drawer component can be summarized for most cases by the table
below.
The figure above indicates how IBM design and influence may flow through the different layers of a
representative enterprise system as compared to other designs that might not have the same level of
control. Where IBM provides the primary design and manufacturing test criteria, IBM can be responsible
for integrating all the components into a coherently performing system and verifying the stack during
design verification testing.
In the end-user environment, IBM likewise becomes responsible for resolving problems that may occur
relative to design, performance, failing components and so forth, regardless of which elements are
involved.
Incorporate Experience
Being responsible for much of the system, IBM puts in place a rigorous structure to identify issues that
may occur in deployed systems and identify solutions for any pervasive issue. Having support for the
design and manufacture of many of these components, IBM is best positioned to fix the root cause of
problems, whether changes in design, manufacturing, service strategy, firmware or other code is needed.
The detailed knowledge of previous system performance has a major influence on future systems design.
This knowledge lets IBM invest in improving the discovered limitations of previous generations. Beyond
that, it also shows the value of existing RAS features. This knowledge justifies investing in what is
important and allows for adjustment to the design when certain techniques are shown to be no longer of
much importance in later technologies or where other mechanisms can be used to achieve the same
ends with less hardware overhead.
It is not feasible to detect or isolate every possible fault or combination of faults that a server
might experience, though it is important to invest in error detection and build a coherent architecture for
how errors are reported and faults isolated. The sub-section on processor and memory error detection
and fault isolation details the IBM Power approach for these system elements.
It should be pointed out error detection may seem like a well understood and universal hardware
In general, I/O adapters may also have less hardware error detection capability where they can rely on a
software protocol to detect and recover from faults when such protocols are used.
Redundant Definition
Redundancy is generally a means of continuing operation in the presence of certain faults by providing
more components/capacity than is needed to avoid outages but where a service action will be taken to
replace the failed component after a fault.
Sometimes redundant components are not actively in use unless a failure occurs. For example, a
processor may only actively use one clock source at a time even when redundant clock sources are
provided.
In contrast, fans and power supplies are typically all active in a system. If a system is said to have “n+1”
fan redundancy, for example, all “n+1” fans will normally be active in a system absence a failure. If a fan
fails occurs, the system will run with “n” fans. In cases where there are fans or power supply failures,
power and thermal management code may compensate by increasing fan speed or making other
adjustments according to operating conditions per power management mode and power/thermal
management policy.
Spare Definition
A spare component is similar in nature though when a “spare is successfully used” the system can
continue to operate without the need to replace the component.
As an example, for voltage regulator output modules, if five output phases are needed to maintain the
power needed at the given voltage level, seven could be deployed initially. It would take the failure of
three phases to cause an outage.
If on the first phase failure, the system continues to operate, and no call out is made for repair, the first
failing phase would be considered spare. After the failure (spare is said to be used), the VRM could
experience another phase failure with no outage. This maintains the required n+1 redundancy. Should a
second phase fail, a “redundant” phase would then have been said to fail and a call-out for repair would
be made.
Build System Level RAS Rather Than Just Processor and Memory RAS
IBM builds Power systems with the understanding that every item that can fail in a system is a potential
source of outage.
While building a strong base of availability for the computational elements such as the processors and
memory is important, it is hardly sufficient to achieve application availability.
The failure of a fan, a power supply, a voltage regulator, or I/O adapter might be more likely than the
failure of a processor module designed and manufactured for reliability.
Scale-out servers will maintain redundancy in the power and cooling subsystems to avoid system outages
due to common failures in those areas. Concurrent repair of these components is also provided.
For the Enterprise system, a higher investment in redundancy is made. The Power E980 system, for
example is designed from the start with the expectation that the system must be largely shielded from the
failure of these other components causing persistent system unavailability; incorporating substantial
redundancy within the service infrastructure (such as redundant service processors, redundant processor
boot images, and so forth.) An emphasis on the reliability of components themselves are highly reliable
and meant to last.
This level of RAS investment extends beyond what is expected and often what is seen in other server
designs. For example, at the system level such selective sparing may include such elements as a spare
voltage phase within a voltage regulator module.
Figure 25: Handled Errors Classified by Severity and Service Actions Required
The illustration above shows a rough view of the POWER9 scale-up processor design leveraging SMT8
cores (a maximum of 12 cores shown.)
The POWER9 design is certainly more capable. There is a maximum of 12 SMT8 cores compared to 2
SMT2 cores. The core designs architecturally have advanced in function as well. The number of memory
controllers has doubled, and the memory controller design is also different.
The addition of system-wide functions such as the NX accelerators and the CAPI and NVLINK interfaces
provide functions just not present in the hardware of the POWER6 system.
The POWER9 design is also much more integrated. The L3 cache is internal, and the I/O processor host
bridge is integrated into the processor. The thermal management is now conducted internally using the
on-chip controller.
There are reliability advantages to the integration. It should be noted that when a failure does occur it
more likely to be a processor module at fault compared to previous generations with less integration.
Hierarchy of Error Avoidance and Handling
In general, there is a hierarchy of techniques used in POWER™ processors to avoid or mitigate the
impact of hardware errors. At the lowest level in the hierarchy is the design for error detection and fault
isolation, the technology employed, specifically as relates to reducing the instances of soft error not only
through error correction, but in the selection of devices within the processor IC.
Because of the extensive amount of functionality beyond just processor cores and caches, listing all the
RAS capabilities of the various system elements would require considerable detail. In somewhat broader
strokes, the tables below discuss the major capabilities in each area.
Cache Error Handling The L2 and L3 caches in the processor use an ECC code that allows for
P9 P9 Processor P9 Processor
Scaleout Processor
Riser Card
Memory Memory Memory Memory
Memory 2 Rank DIMM Supporting 2
Buffer Buffer Buffer Buffer
Buffer
CDIMM 128 Bit ECC word DRAM Groups
ISDIMM
ISDIMM
ISDIMM
ISDIMM ISDIMM
ISDIMM
ISDIMM
ISDIMM
Processor
Cache
Memory Controller
Memory Bus
Driver/Receiver
On a DIMM multiple DRAM modules are accessed to create a memory word. An “Industry Standard”
DIMM is commonly used to supply processors fetching 64 byte cache lines.
A rank of an x4 “Industry Standard” DIMM contains enough DRAMs to provide 64 bits of data at a time
with enough check bits to correct the case of a single DRAM module being bad after the bad DRAM has
been detected and then also at least correct an additional faulty bit.
The ability to generally deal with a DRAM module being entirely bad is what IBM has traditionally called
Chipkill correction. This capability is essential in protecting against a memory outage and should be
considered as a minimum error correction for any modern server design.
Inherently DIMMs with x8 DRAMs use half the number of memory modules and there can be a reliability
advantage just in that alone. From an error checking standpoint, only 4 bad adjacent bits in an ECC word
need to be corrected to have Chipkill correction with a x4 DIMM. When x8 DIMMs are used, a single
Chipkill event requires correcting 8 adjacent bits. A single Industry Standard DIMM in a typical design will
not have enough check bits to handle a Chipkill event in a x8 DIMM.
Rather than directly accessing the DIMMs through the processor, some server designs take advantage of
a separate memory buffer. A memory buffer can improve performance and it can also allow servers to
take advantage of two Industry standard DIMMs by extending the ECC word in some sense across two
DIMMs.
Memory Controller
A 64 Byte Cache line could be filled from 2 DIMMs,
Memory Bus
- Each Accessed in an chop-burst yielding ½ the data of full 8 beat burst
Driver/Receiver
- Or read independently and managed by the memory buffer
to fill the 64 byte cache line
Memory Buffer
Combined 128 Bits of data across two DIMMs in a beat + up to 16 additional bits for checksum
Hence across two DIMMs, with a x4 DRAM module, a single DRAM module being failed can be tolerated
and then another, or what might be called double-Chipkill though this is not a precise term. With x8
DRAMs at least across the two DIMMs a single Chipkill event can be corrected once the fault has been
identified and the bad DRAM marked.
In system designs where only 64 byte ECC caches are filled there can be a performance penalty
associated with running in 128 byte mode across two DIMMs since the cache lines are still 64 byte and
the DIMMs are still designed to access 64 bytes in burst. Hence in newer designs, 128 byte ECC word
access may only be switched to after the first Chipkill event.
IBM does not believe that using DIMMs without Chipkill capability is appropriate in this space, so x8
DIMMs are not offered.
Memory Controller
§ Supports 128 Byte Cache Line
§ Hardened “Stacked” Latches for Soft Error Protection Memory Ctrl POWER8 DCM with
§ And replay buffer to retry after soft internal faults 8 Memory Buses
Supporting 8 DIMMS
§ Special Uncorrectable error handling for solid faults
Memory Bus
§ CRC protection with recalibration and retry on error Memory Ctrl
16 GB DIMM
§ 4 Ports of Memory
– 10 DRAMs x8 DRAMs attached to each port
– 8 Needed For Data
– 1 Needed For Error Correction Coding
– 1 additional Spare
§ 2 Ports are combined to form a 128 bit ECC Note: Bits used for data and for ECC are spread across
word 9 DRAMs to maximize error correction capability
– 8 Reads fill a processor cache
§ Second port can be used to fill a second DRAM Protection:
cache line Can handle at least 2 bad x8 DRAM modules in every group of 18
(3 if not all 3 failures on the same sub-group of 9)
– (Much like having 2 DIMMs under one Memory
buffer but housed in the same physical DIMM)
Mirrored LMBs
Writes Go to Each Side
Reads alternate between sides
Or from one side only when DIMM fails
Memory Ctrl
Memory Ctrl
Hypervisor
Memory
By selectively mirroring only the segments used by the hypervisor, this protection is provided without the
need to mirror large amounts of memory.
It should be noted that the PowerVM design is a distributed one. PowerVM code can execute at times on
each processor in a system and can reside in small amounts in memory anywhere in the system.
Accordingly, the selective mirroring approach is fine grained enough not to require the hypervisor to sit in
any particular memory DIMMs. This provides the function while not compromising the hypervisor
performance as might be the case if the code had to reside remotely from the processors using a
hypervisor service.
Voltage Regulation
There are many different designs that can be used for supplying power to components in a system.
As described above, power supplies may take alternating current (AC) from a data center power source,
and then convert that to a direct current voltage level (DC).
Modern systems are designed using multiple components, not all of which use the same voltage level.
Possibly a power supply can provide multiple different DC voltage levels to supply all the components in a
system. Failing that, it may supply a voltage level (e.g., 12v) to voltage regulators which then convert to
the proper voltage levels needed for each system component (e.g., 1.6 V, 3.3 V, etc.) Use of such voltage
regulators work to maintain voltage levels within the tight specifications required for the modules they
supply.
Typically, a voltage regulator module (VRM) has some common logic plus a component or set of
components (called converters, channels, or phases). At minimum, a VRM provides one converter (or
phase) that provides the main function of stepped-down voltage, along with some control logic.
Depending on the output load required, however, multiple phases may be used in tandem to provide that
voltage level.
If the number of phases provided is just enough for the load it is driving, the failure of a single phase can
lead to an outage. This can be true even when the 12V power supplies are redundant. Therefore,
additional phases may be supplied to prevent the failure due to a single-phase fault. Additional phases
3
https://www.ibm.com/support/knowledgecenter/en/POWER9/p9ia9/p9ia9_signatures_keys.htm
I/O Slot
I/O Slot
Additional
Slots/Integrat Here
ed I/O
LAN SAN LAN SAN
Adapter Adapter Adapter Adapter
And Here
Virtual
Physical
DCM DCM
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O I/O
Attach Attach
Card Card
x8 x8 x8 x8
PCIe Switch PCIe Switch PCIe Switch PCIe Switch
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
SAN
Adapter
LAN
Adapter
LAN
Adapter
SAN
Adapter
Availability still Maximized Using
Redundant I/O Drawers
Gen3 I/O Drawer 2 Gen3 I/O Drawer 1
Planned Outages
Unplanned outages of systems and applications are typically very disruptive to applications. This is
certainly true of systems running standalone applications, but is also true, perhaps to a somewhat lesser
extent, of systems deployed in a scaled-out environment where the availability of an application does not
entirely depend on the availability of any one server. The impact of unplanned outages on applications in
both such environments is discussed in detail in the next section.
Planned outages, where the end-user picks the time and place where applications must be taken off-line
can also be disruptive. Planned outages can be of a software nature – for patching or upgrading of
applications, operating systems, or other software layers. They can also be for hardware, for
reconfiguring systems, upgrading or adding capacity, and for repair of elements that have failed but have
not caused an outage because of the failure.
If all hardware failures required planned downtime, then the downtime associated with planned outages in
an otherwise well-designed system would far-outpace outages due to unplanned causes.
While repair of some components cannot be accomplished with workload actively running in a system,
design capabilities to avoid other planned outages are characteristic of systems with advanced RAS
capabilities. These may include:
Concurrent Repair
When redundancy is incorporated into a design, it is often possible to replace a component in a system
without taking the entire system down.
As examples, Enterprise Power Systems support concurrently removeable and replaceable elements
such as power supplies and fans.
In addition, Enterprise Power Systems as well as POWER9 processor-based 1s and 2s systems support
concurrently removing and replacing I/O adapters according to the capabilities of the OS and
applications.
Integrated Sparing
As previously mentioned, to reduce replacements for components that cannot be removed and replaced
without taking down a system, Power Systems strategy includes the use of integrated spare components
that can be substituted for failing ones.
Infrastructure Infrastructure
Hardware
Service Processor Management Service Processor
Console (HMC)
In simplified terms, LPM typically works in an environment where all the I/O from one partition is
virtualized through PowerVM and VIOS and all partition data is stored in a Storage Area Network (SAN)
accessed by both servers.
To migrate a partition from one server to another, a partition is identified on the new server and
configured to have the same virtual resources as the primary server including access to the same logical
volumes as the primary using the SAN.
When an LPM migration is initiated on a server for a partition, PowerVM begins the process of
dynamically copying the state of the partition on the first server to the server that is the destination of the
migration.
Thinking in terms of using LPM for hardware repairs, if all the workloads on a server are migrated by LPM
to other servers, then after all have been migrated, the first server could be turned off to repair
components.
LPM can also be used for doing firmware upgrades or adding additional hardware to a server when the
hardware cannot be added concurrently in addition to software maintenance within individual partitions.
Minimum Configuration
For detailed information on how LPM can be configured the following references may be useful: An IBM
Redbook titled: IBM PowerVM Virtualization Introduction and Configuration4 as well as the document
Live Partition Mobility 5
In general terms LPM requires that both the system containing a partition to be migrated and the system
being migrated have a local LAN connection using a virtualized LAN adapter. In addition, LPM requires
that all systems in the LPM cluster be attached to the same SAN. If a single HMC is used to manage both
systems in the cluster, connectivity to the HMC also needs to be provided by an Ethernet connection to
each service processor.
The LAN and SAN adapters used by the partition must be assigned to a Virtual I/O server and the
partitions access to these would be by virtual LAN (vLAN) and virtual SCSI (vSCSI) connections within
each partition to the VIOS.
I/O Redundancy Configurations and VIOS
LPM connectivity in the minimum configuration discussion is vulnerable to a number of different hardware
and firmware faults that would lead to the inability to migrate partitions. Multiple paths to networks and
SANs are therefore recommended. To accomplish this, Virtual I/O servers (VIOS) can be used.
VIOS as an offering for PowerVM virtualizes I/O adapters so that multiple partitions will be able to utilize
the same physical adapter. VIOS can be configured with redundant I/O adapters so that the loss of an
adapter does not result in a permanent loss of I/O to the partitions using the VIOS.
Externally to each system, redundant hardware management consoles (HMCs) can be utilized for greater
availability. There can also be options to maintain redundancy in SANs and local network hardware.
Infrastructure Infrastructure
Hardware
Management
Hardware
Service Processor Console (HMC)
Management Service Processor
Console (HMC)
Figure generally illustrates multi-path considerations within an environment optimized for LPM.
Within each server, this environment can be supported with a single VIOS. However, if a single VIOS is
used and that VIOS terminates for any reason (hardware or software caused) then all the partitions using
that VIOS will terminate.
4
Mel Cordero, Lúcio Correia, Hai Lin, Vamshikrishna Thatikonda, Rodrigo Xavier, Sixth Edition published June 2013,
5
IBM, 2018, ftp://ftp.software.ibm.com/systems/power/docs/hw/p9/p9hc3.pdf
Logical Partition
LAN fc LAN fc
adapter adapter adapter adapter
LAN fc LAN fc
adapter adapter adapter adapter
Since each VIOS can largely be considered as an AIX based partition, each VIOS also needs the ability
to access a boot image, having paging space, and so forth under a root volume group or rootvg. The
rootvg can be accessed through a SAN, the same as the data that partitions use. Alternatively, a VIOS
can use storage locally attached to a server, either DASD devices or SSD drives such as the internal
NVMe drives provided for the Power E980 and Power E950 systems. For best availability, the rootvgs
should use mirrored or other appropriate RAID drives with redundant access to the devices.
PowerVC™ and Simplified Remote Restart
PowerVC is an enterprise virtualization and cloud management offering from IBM that streamlines virtual
machine deployment and operational management across servers. The IBM Cloud PowerVC Manager
edition expands on this to provide self-service capabilities in a private cloud environment; IBM offers a
Redbook that provides a detailed description of these capabilities. As of the time of this writing: IBM
PowerVC Version 1.3.2 Introduction and Configuration6 describes this offering in considerable detail.
Deploying virtual machines on systems with the RAS characteristics previously described will best
leverage the RAS capabilities of the hardware in a PowerVC environment. Of interest in this availability
discussion is that PowerVC provides a virtual machine remote restart capability, which provides a means
of automatically restarting a VM on another server in certain scenarios (described below).
Systems with a Hardware Management Console (HMC) may also choose to leverage a simplified remote
restart capability (SRR) using the HMC.
Error Detection in a Failover Environment
The conditions under which a failover is attempted It is important when talking about any sort of failover
scenario. Some remote restart capabilities, for example, operate only after an error management system,
e.g., an HMC reports that a partition is in an Error or Down State.
6
January 2017, International Technical Support Organization, Javier Bazan Lazcano and Martin Parrella
Introduction
All of the previous sections in this document discussed server specific RAS features and options. This
section looks at the more general concept of RAS as it applies to any system in the data center. The goal
is to briefly define what RAS is and look at how reliability and availability are measured. It will then
discuss how these measurements may be applied to different applications of scale-up and scale-out
servers.
RAS Defined
Mathematically, reliability is defined in terms of infrequently something fails.
At a system level, availability is about how infrequently failures cause workload interruptions. The longer
the interval between interruptions, the more available a system is.
Serviceability is all about how efficiently failures are identified and dealt with, and how application outages
are minimized during repair.
Broadly speaking systems can be categorized as ”scale-up” or ”scale-out” depending on the impact to
applications or workload of a system being unavailable.
True scale-out environments typically spread workload among multiple systems so that the impact of a
single system failing, even for a short period of time is minimal.
In scale-up systems, the impact of a server taking a fault, or even a portion of a server (e.g., an individual
partition) is significant. Applications may be deployed in a clustered environment so that extended
outages can in a certain sense be tolerated (e.g., using some sort of fail-over to another system) but even
the amount of time it takes to detect the issue and fail-over to another device is deemed significant in a
scale-up system.
Reliability Modeling
The prediction of system level reliability starts with establishing the failure rates of the individual
components making up the system. Then using the appropriate prediction models, the component level
failure rates are combined to give us the system level reliability prediction in terms of a failure rate.
In literature, however, system level reliability is often discussed in terms of Mean Time Between Failures
(MTBF) for repairable systems rather than a failure rate. For example, 50 years Mean Time Between
Failures. A 50 years MTBF may suggest that a system will run 50 years between failures, but means
more like that given 50 identical systems, one in a year will fail on average over a large population of
systems.
The following illustration explains roughly how to bridge from individual component reliability to system
reliability terms with some rounding and assumptions about secondary effects:
Failures Per 100,000 Parts By Quarter Ideally Systems are composed of multiple parts each
12
with very small failure rates that may vary over time.
For Example
10
4
▪ 1 part in 1 100,000 for a some time (steady state failure rate)
\ 2 Steady State ▪ Then increasing rate of failures until system component in
0
use is retired (wear-out rate)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
▪ Typically this is described as a bathtub curve
Quarter
FIT = Failure In Time Roughly 100 FITs means Mean Time Between 1 fail/1000 Systems
1 FIT = 1 Failure per 1 A Failure Rate of 1 Failure in Failures is the inverse per year
Billion hours 1000 systems per year of Failure Rate = 1000 Year MTBF
FIT of Component 1 +
1
FIT of Component 2 + Mean Time
= Between Failure
System System
FIT of Component 3 + Failure Rate Failure Rate (MTBF)
…
FIT of Component n +
Example
100 FITs +
50 FITs +
70 FITs + If 100 Fits is 1 Fail per 1000
30 FITs + systems per year
200 FITs + -------------------- 1000 Systems/10 Failures in a year
50 FITs + 1000 Fits = equates to 100 Years MTBF
30 FITs + 10 Failures Per 1000 Systems
170 FITs +
per Year
50 FITs +
250 FITs =
1,000 FITs For System
Service Costs
It is common for software costs in an enterprise to include the cost to acquire or license code and a
recurring cost for software maintenance. There is also a cost associated with acquiring the hardware and
a recurring cost associated with hardware maintenance – primarily fixing systems when components
break.
Individual component failure rates, the cost of the components, and the labor costs associated with repair
can be aggregated at the system level to estimate the direct maintenance costs expected for a large
population of such systems.
The more reliable the components the less the maintenance costs should be.
Design for reliability can not only help with maintenance cost to a certain degree but also with the initial
cost of a system as well – in some cases. For example, designing a processor with extra cache capacity,
data lanes on a bus or so forth can make it easier to yield good processors at the end of a manufacturing
process as an entire module need not be discarded due to a small flaw.
At the other extreme, designing a system with an entire spare processor socket could significantly
decrease maintenance cost by not having to replace anything in a system should a single processor
module fail. However, each system will incur the costs of a spare processor for the return of avoiding a
repair in the small proportion of those that need repair. This is usually not justified from a system cost
perspective. Rather it is better to invest in greater processor reliability.
On scale-out systems redundancy is generally only implemented on items where the cost is relatively low,
and the failure rates expected are relatively high – and in some cases where the redundancy is not
complete. For example, power supplies and fans may be considered redundant in some scale-out
systems because when one fails, operation will continue. However, depending on the design, when a
component fails, fans may have to be run faster, and performance-throttled until repair.
On scale-up systems redundancy that might even add significantly to maintenance costs is considered
worthwhile to avoid indirect costs associated with downtime, as discussed below.
Measuring Availability
Mathematically speaking, availability is often expressed as a percentage of the time something is
available or in use over a given period of time. An availability number for a system can be mathematically
calculated from the expected reliability of the system so long as both the mean time between failures and
the duration of each outage is known.
For example: Consider a system that always runs exactly one week between failures and each time it
fails, it is down for 10 minutes. For the 168 hours in a week, the system is down (10/60) hours. It is up
168hrs – (10/60) hrs. As a percentage of the hours in the week, it can be said that the system is (168-
(1/6))*100% = 99.9% available.
99.999% available means approximately 5.3 minutes down in a year. On average, a system that failed
once a year and was down for 5.3 minutes would be 99.999% available. This is often called 5 9’s of
availability.
When talking about modern server hardware availability, short weekly failures like in the example above
is not the norm. Rather the failure rates are much lower and the mean time between failures (MTBF) is
often measured in terms of years – perhaps more years than a system will be kept in service.
Therefore, when a MTBF of 10 years, for example, is quoted, it is not expected that on average each
system will run 10 years between failures. Rather it is more reasonable to expect that on average, in a
given year, one server out of ten will fail. If a population of ten servers always had exactly one failure a
year, a statement of 99.999% availability across that population of servers would mean that the one
server that failed would be down about 53 minutes when it failed.
In theory 5 9’s of availability can be achieved by having a system design which fails frequently, multiple
times a year, but whose failures are limited to very small periods of time. Conversely 5 9’s of available
might mean a server design with a very large MTBF, but where a given server takes a fairly long time to
recover from the very rare outage.
5 9's Of Availability
Average Time
Down Per
Failure
(Minutes)
1 2 5 15 10 20 30 50 100
Mean Time Between Failures (Years)
The figure above shows that 5 9’s of availability can be achieved with systems that fail frequently for
miniscule amounts of time, or very infrequently with much larger downtime per failure.
The figure is misleading in the sense that servers with low reliability are likely to have many components
that, when they fail, take the system down and keep the system down until repair. Conversely servers
designed for great reliability often are also designed so that the systems, or at least portions of the
system can be recovered without having to keep a system down until repaired.
Hence on the surface systems with low MTBF would have longer repair-times and a system with 5 9’s of
availability would therefore be synonymous with a high level of reliability.
However, in quoting an availability number, there needs to be a good description of what is being quoted.
Is it only concerning unplanned outages that take down an entire system? Is it concerning just hardware
faults, or are firmware, OS and application faults considered?
Are applications even considered? If they are, if multiple applications are running on the server, is each
application outage counted individually? Or does one event causing multiple application outages count as
a single failure?
If there are planned outages to repair components either delayed after an unplanned outage, or
predictively, is that repair time included in the unavailability time? Or are only unplanned outages
considered?
Perhaps most importantly when reading that a certain company achieved 5 9’s of availability for an
application - is knowing if that number counted application availability running in a standalone
environment? Or was that a measure of application availability in systems that might have a failover
capability.
Total
Mean time to Recovery Minutes
Outage (in minutes/Incid Down Per Associated
Outage Reason Years) Recovery Activities Needed ent Year Availabilty
Fault Limited To
3 x 7.00 2.33 99.99956%
Application
Fault Causing OS crash
10 x x 11.00 1.10 99.99979%
Fault causing hypervisor
80 x x x 16.00 0.20 99.99996%
crash
Fault impacting system
(crash) but system
recovers on reboot with 80 x x x x 26.00 0.33 99.99994%
enough resources to
restart application
Planned hardware repair
for hw fault (where initial
fault impact could be 70 x x x x x 56.00 0.80 99.99985%
any of the above)
This is not intended to represent any given system. Rather, it is intended to illustrate how different
outages have different impacts. An application crash only requires that the crash be discovered, and the
application restarted. Hence there is only an x in the column for the 7 minutes application restart and
recovery time.
If an application is running under an OS and the OS crashes, then the total recovery time must include
the time it takes to reboot the OS plus the time it takes to detect the fault and recover the application after
the OS reboots. In the example with an x in each of the first two columns the total recovery time is 11
minutes (4 minutes to recover the OS and 7 for the application.)
The worst-case scenario as described in the previous section is a case where the fault causes a system
to go down and stay down until repaired. In the example, with an x in all the recovery activities column,
that would mean 236 minutes of recovery for each such incident.
In the hypothetical example numbers were chosen to illustrate 5 9s of availability across a population of
systems.
This required that the worst-case outage scenarios to be extremely rare compared to the application only-
outages.
In addition, the example presumed that:
1. All the software layers can recover reasonably efficiently even from entire system
crashes.
2. There were no more than a reasonable number of applications driven and operating
system driven outages.
3. A very robust hypervisor is used, expecting it to be considerably more robust than the
application hosting OS.
4. Exceptionally reliable hardware is used. (The example presumes about 70 MTBF for
hardware faults.)
Similar to the single-system example, it shows the unavailability associated with various failure types.
However, it presumes that application recovery occurs by failing over from one system to another. Hence
the recovery time for any of the outages is limited to the time it takes to detect the fault and fail-over and
recover on another system. This minimizes the impact of faults that in the standalone case, while rare,
would lead to extended application outages.
The example suggests that fail-over clustering can extend availability beyond what would be achieved in
the standalone example.
The example presumes somewhat longer recovery for the non-enterprise hardware due to the other kinds
of real-world conditions described in terms of parts acquisition, error detection/fault isolation (ED/FI) and
so forth.
Clustering Resources
One of the obvious disadvantages of running in a clustered environment, as opposed to a standalone
system environment, is the need for additional hardware to accomplish the task.
An application running full-throttle on one system, prepared to failover on another, needs to have a
comparable capability (available processor cores, memory and so forth) on that other system.
There does not need to be exactly one back-up server for every server in production, however. If multiple
servers are used to run work-loads, then only a single backup system with enough capacity to handle the
workload of any one server might be deployed.
Alternatively, if multiple partitions are consolidated on multiple servers, then presuming that no server is
fully utilized, fail-over might be planned so that one failing server will restart on different partitions on
multiple different servers.
When an enterprise has sufficient workload to justify multiple servers, either of these options reduces the
overhead for clustering.
Hypervisor/Virtualization Hypervisor/Virtualization
Computational Virtualized LAN Adapters LAN Adapters Virtualized Computational
Hardware I/O Servers I/O Servers Hardware
SAN Adapters SAN Adapters
Infrastructure Infrastructure
Shared Storage
Application Layer Using a SAN Application Layer
Shared Storage
Hypervisor/Virtualization Using a SAN Hypervisor/Virtualization
Computational Virtualized SAN Adapters LAN Adapters Virtualized Computational
Hardware I/O Servers I/O Servers Hardware
LAN Adapters SAN Adapters
Infrastructure Infrastructure
SAN Adapters
LAN Adapters
Hypervisor/Virtualization
Application Layer
Infrastructure
*Where “down” and availability refer to the service period provided, not the customer application.
In understanding the service level agreement, what the “availability of the service” means is critical to
understanding the SLA.
Presuming the service is a virtual machine consisting of certain resources (processor cores/memory)
these resources would typically be hosted on a server. Should a failure occur which terminates
applications running on the virtual machine, depending on the SLA, the resources could be switched to a
different server.
If switching the resources to a different server takes no more than 4.38 minutes and there is no more than
a single failure in a month, then the SLA of 99.99% would be met for the month.
However, such an SLA might take no account of how disruptive the failure to the application might be.
While the service may be down for a few minutes it could take the better part of an hour or longer to
restore the application being hosted. While the SLA may say that the service achieved 99.99% availability
in such a case, application availability could be far less.
Consider the case where an application hosted on a virtual machine (VM) with a 99.99% availability for
the VM. To achieve the SLA, the VM would need to be restored in no more than about 4.38 minutes. This
typically means being restored to a backup system.
If the application takes 100 minutes to recover after a new VM is made available (for example), the
application availability would be more like 99.76% for that month.
Downtime
Actual if one Actual Downtime Actual
Downtime Availability outage in availability if 12 Availability
Outage in for the a year for that outages for the
Reasons minutes Month (minutes) year in a year Year
Service
4.38 4.38 52.56
Outage
Downtime for
application to
restore on 100.00 100.00 1200
new virtual
machine
Total
Downtime
(Sum of
Service 104.38 99.76% 104.38 99.98% 1252.56 99.76%
outage +
application
recovery)
If that were the only outage in the year, the availability across the year would be around 99.99%.
The SLA, however, could permit a single outage every month.
In such a case with the application downtime typically orders of magnitude higher than the server outage
time, even an SLA of 99.99% Availability or 4.38 minutes per month will prove disruptive to critical
applications even in a cloud environment.
The less available the server is, the more frequent the client application restarts are possible. These
more frequent restarts mean the client do not have access to their applications and its effect is
compounded over time.
In such situations the importance of using enterprise class servers for application availability can’t be
understood just by looking at a monthly service level agreement.
To summarize what was stated previously, it is difficult to compare estimates or claims of availability
without understanding specifically:
1. What kind of failures are included (unplanned hardware only, or entire stack)?
2. What is the expected meantime between failures and how is it computed (monthly SLA,
an average across multiple systems over time, etc.)?
3. What is done to restore compute facilities in the face of a failure and how recovery time
is computed?
4. What is expected of both hardware and software configuration to achieve the availability
targets?
5. And for actual application availability, what the recovery time of the application is given
each of the failure scenarios?
The serviceability features delivered in IBM Power system ensure a highly efficient service environment
by incorporating the following attributes:
Service Environment
In the PowerVM environment, the HMC is a dedicated server that provides functions for configuring and
managing servers for either logical partitioned or full-system partition using a GUI or command-line
interface (CLI) or REST API. An HMC attached to the system enables support personnel (with client
authorization) to remotely or locally login to review error logs and perform remote maintenance if required.
• HMC Attached - one or more HMCs or vHMCs are supported by the system with PowerVM. This
is the default configuration for servers supporting logical partitions with dedicated or virtual I/O. In
this case, all servers have at least one logical partition.
• HMC less - There are two service strategies for non-HMC managed systems.
1. Full-system partition with PowerVM: A single partition owns all the server resources
and only one operating system may be installed. The primary service interface is through
the operating system and the service processor.
2. Partitioned system with NovaLink: In this configuration, the system can have more
than one partition and can be running more than one operating system. The primary
service interface is through the service processor.
Service Interface
Support personnel can use the service interface to communicate with the service support applications in a
server using an operator console, a graphical user interface on the management console or service
processor, or an operating system terminal. The service interface helps to deliver a clear, concise view of
available service applications, helping the support team to manage system resources and service
information in an efficient and effective way. Applications available through the service interface are
carefully configured and placed to give service providers access to important service functions.
Different service interfaces are used, depending on the state of the system, hypervisor, and operating
environment. The primary service interfaces are:
In the light path LED implementation, the system can clearly identify components for replacement by
using specific component-level LEDs and can also guide the servicer directly to the component by
signaling (turning on solid) the enclosure fault LED, and component FRU fault LED. The servicer can
also use the identify function to blink the FRU-level LED. When this function is activated, a roll-up to the
blue enclosure identify LED will occur to identify an enclosure in the rack. These enclosure LEDs will turn
on solid and can be used to follow the light path from the enclosure and down to the specific FRU in the
PowerVM environment.
FFDC information, error data analysis, and fault isolation are necessary to implement the advanced
serviceability techniques that enable efficient service of the systems and to help determine the failing
items.
In the rare absence of FFDC and Error Data Analysis, diagnostics are required to re-create the failure and
determine the failing items.
Diagnostics
General diagnostic objectives are to detect and identify problems so they can be resolved quickly.
Elements of IBM's diagnostics strategy is to:
• Provide a common error code format equivalent to a system reference code with PowerVM,
system reference number, checkpoint, or firmware error code.
• Provide fault detection and problem isolation procedures. Support remote connection capability
that can be used by the IBM Remote Support Center or IBM Designated Service.
• Provide interactive intelligence within the diagnostics, with detailed online failure information,
while connected to IBM's back-end system.
Automated Diagnostics
The processor and memory FFDC technologies are designed to perform without the need for problem re-
creation nor the need for user intervention. The firmware runtime diagnostics code leverages these
hardware fault isolation facilities to accurately determine system problems and to take the appropriate
actions. Most solid and intermittent errors can be correctly detected and isolated, at the time the failure
occurs that is whether during runtime or boot-time. In the few situations that automated system
diagnostics cannot decipher the root cause of an issue, service support intervention is required.
Stand-alone Diagnostics
As the name implies, stand-alone or user-initiated diagnostics requires user intervention. The user must
perform manual steps, which may include:
Concurrent Maintenance
The determination of whether a firmware release can be updated concurrently is identified in the readme
information file that is released with the firmware. An HMC is required for the concurrent firmware update
with PowerVM. In addition, as discussed in more details in other sections of this document, concurrent
maintenance of PCIe adapters and NVMe drives are supported with PowerVM. Power supplies, fans and
op panel LCD are hot pluggable as well.
Service Labels
Service providers use these labels to assist them in performing maintenance actions. Service labels are
found in various formats and positions and are intended to transmit readily available information to the
servicer during the repair process. Following are some of these service labels and their purpose:
• Location diagrams: Location diagrams are located on the system hardware, relating information
regarding the placement of hardware components. Location diagrams may include location
codes, drawings of physical locations, concurrent maintenance status, or other data pertinent to a
repair. Location diagrams are especially useful when multiple components such as DIMMs,
processors, fans, adapter cards, and power supplies are installed.
• Remove/replace procedures: Service labels that contain remove/replace procedures are often
found on a cover of the system or in other spots accessible to the servicer. These labels provide
systematic procedures, including diagrams detailing how to remove or replace certain serviceable
hardware components.
• Arrows: Numbered arrows are used to indicate the order of operation and the serviceability
direction of components. Some serviceable parts such as latches, levers, and touch points need
to be pulled or pushed in a certain direction and in a certain order for the mechanical mechanisms
to engage or disengage. Arrows generally improve the ease of serviceability.
QR Labels
QR labels are placed on the system to provide access to key service functions through a mobile device.
When the QR label is scanned, it will go to a landing page for Power10 processor-based systems. The
landing page contains links to each MTM service functions and its useful to a servicer or operator
physically located at the machine. The service functions include things such as installation and repair
instructions, reference code look up, and so on.
• Color coding (touch points): Blue-colored touch points delineate touchpoints on service
components where the component can be safely handled for service actions such as removal or
installation.
• Tool-less design: Selected IBM systems support tool-less or simple tool designs. These designs
require no tools or simple tools such as flathead screw drivers to service the hardware
components.
When an HMC is attached in the PowerVM environment, an ELA routine analyzes the error, forwards the
event to the Service Focal Point (SFP) application running on the HMC, and notifies the system
administrator that it has isolated a likely cause of the system problem. The service processor event log
also records unrecoverable checkstop conditions, forwards them to the SFP application, and notifies the
system administrator.
The system has the ability to call home through the operating system to report platform-recoverable errors
and errors associated with PCIe adapters/devices.
In the HMC-managed environment, a call home service request will be initiated from the HMC and the
pertinent failure data with service parts information and part locations will be sent to an IBM service
organization. Customer contact information and specific system related data such as the machine type,
model, and serial number, along with error log data related to the failure, are sent to IBM Service.
Call Home
Call home refers to an automatic or manual call from a client location to the IBM support structure with
error log data, server status, or other service-related information. Call home invokes the service
organization in order for the appropriate service action to begin. Call home can be done through the
Electronic Service Agent (ESA) imbedded in the HMC or through a version of ESA imbedded in the
operating systems for non-HMC managed or a version of ESA that runs as a standalone call home
application. While configuring call home is optional, clients are encouraged to implement this feature in
order to obtain service enhancements such as reduced problem determination and faster and potentially
more accurate transmittal of error information. In general, using the call home feature can result in
increased system availability. See the next section for specific details on this application.
Benefits of ESA
• Increased Uptime: ESA is designed to enhance the warranty and maintenance service by
potentially providing faster hardware error reporting and uploading system information to IBM
Support. This can optimize the time monitoring the symptoms, diagnosing the error, and manually
calling IBM Support to open a problem record. And 24x7 monitoring and reporting means no
more dependency on human intervention or off-hours client personnel when errors are
encountered in the middle of the night.
• Security: The ESA tool is designed to help secure the monitoring, reporting, and storing of the
data at IBM. The ESA tool is designed to help securely transmit through the internet (HTTPS) to
provide clients a single point of exit from their site. Initiation of communication is one way.
Activating ESA does not enable IBM to call into a client's system. For additional information, see
the IBM Electronic Service Agent website.
• More Accurate Reporting: Because system information and error logs are automatically
uploaded to the IBM Support Center in conjunction with the service request, clients are not
required to find and send system information, decreasing the risk of misreported or misdiagnosed
errors. Once inside IBM, problem error data is run through a data knowledge management
system, and knowledge articles are appended to the problem record.
This web portal provides valuable reports of installed hardware and software using information collected
from the systems by IBM Electronic Service Agent. Reports are available for any system associated with
the customer's IBM ID.
For more information on how to utilize client support portal, visit the following website: Client Support
Portal or contact an IBM Systems Services Representative (SSR).
Investing in RAS
Systems designed for RAS may be more costly at the “bill of materials” level than systems with little
investment in RAS.
Some examples as to why this could be so:
In terms of error detection and fault isolation: Simplified, at the low level, having an 8-bit bus takes a
certain amount of circuits. Adding an extra bit to detect a single fault, adds hardware to the bus. In a class
Hamming code, 5 bits of error checking data might be required for 15 bits of data to allow for double-bit
error detection, and single bit correction. Then there is the logic involved in generating the error detection
bits and checking/correcting for errors.
In some cases, better availability is achieved by having fully redundant components which more than
doubles the cost of the components, or by having some amount of n+1 redundancy or sparing which still
adds costs at a somewhat lesser rate.
In terms of reliability, highly reliable components will cost more. This may be true of the intrinsic design,
the materials used including the design of connectors, fans and power supplies.
Increased reliability in the way components are manufactured can also increase costs. Extensive time in
manufacture to test, a process to ”burn-in” parts and screen out weak modules increases costs. The
highest levels of reliability of parts may be achieved by rejecting entire lots –even good components -
when the failure rates overall for a lot are excessive. All of these increase the costs of the components.
Design for serviceability, especially for concurrent maintained typically is more involved than a design
where serviceability is not a concern. This is especially true when designing, for example, for concurrent
maintenance of components like I/O adapters.
Beyond the hardware costs, it takes development effort to code software to take advantage of the
hardware RAS features and time again to test for the many various ”bad-path” scenarios that can be
envisioned.
On the other hand, in all systems, scale-up and scale-out, investing in system RAS has a purpose. Just
as there is recurring costs for software licenses in most enterprise applications, there is a recurring cost
associated with maintaining systems. These include the direct costs, such as the cost for replacement
components and the cost associated with the labor required to diagnose and repair a system.
The somewhat more indirect costs of poor RAS are often the main reasons for investing in systems with
superior RAS characteristics and overtime these have become even more important to customers. The
importance is often directly related to:
The importance associated with discovery errors before relying on faulty data or computation
including the ability to know when to switch over to redundant or alternate resources.
The costs associated with downtime to do problem determination or error re-creation, if
insufficient fault isolation is provided in the system.
The cost of downtime when a system fails unexpectedly or needs to fail over when an application
is disrupted during the failover process.
The costs associated with planning an outage to or repair of hardware or firmware, especially
when the repair is not concurrent.
In a cloud environment, the operations cost of server evacuation.
In a well-designed system investing in RAS minimizes the need to repair components that are failing.
Systems that recover rather than crash and need repair when certain soft errors occur will minimize
indirect costs associated with such events. Use of selective self-healing so that, for example, a processor
Final Word
The POWER9 and Power10 processor-based systems discussed leverage the long heritage of Power
Systems designed for RAS. The different servers aimed at different scale-up and scale-out environments
provide significant choice in selecting servers geared towards the application environments end-users will
deploy. The RAS features in each segment differ but in each provide substantial advantages compared to
designs with less of an up-front RAS focus.