0% found this document useful (0 votes)
91 views

Introduction To IBM® Power® Reliability, Availability, and Serviceability For POWER9® Processor-Based Systems Using IBM PowerVM™

Uploaded by

jimmy2154
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views

Introduction To IBM® Power® Reliability, Availability, and Serviceability For POWER9® Processor-Based Systems Using IBM PowerVM™

Uploaded by

jimmy2154
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

IBM Power® RAS Whitepaper

Introduction to IBM Power® Reliability, Availability, and Serviceability for POWER9® processor-based
systems using IBM PowerVM™
With Updates covering the latest Power10 processor-based systems

IBM Systems Group


Daniel Henderson, Irving Baysah
Trademarks, Copyrights, Notices and Acknowledgements
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business
Machines Corporation in the United States, other countries, or both. These and other IBM trademarked
terms are marked on their first occurrence in this information with the appropriate symbol (® or ™),
indicating US registered or common law trademarks owned by IBM at the time this information was
published. Such trademarks may also be registered or common law trademarks in other countries. A
current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks of the International Business Machines Corporation in the United
States, other countries, or both:

Active AIX® POWER® POWER Power Power Systems


Memory™ Hypervisor™ Systems™ Software™

Power® POWER POWER7 POWER8™ POWER® PowerLinux™


7® +™
POWER® PowerHA®
POWER6
®

PowerVM System System PowerVC™ POWER Power Architecture™


® x® z® Hypervisor™

Additional Trademarks may be identified in the body of this document.


Other company, product, or service names may be trademarks or service marks of others.
Notices
The last page of this document contains copyright information, important notices, and other information.
Acknowledgements
While this whitepaper has two principal authors/editors it is the culmination of the work of a number of
different subject matter experts within IBM who contributed ideas, detailed technical information, and the
occasional photograph and section of description.
These include the following:
Kanisha Patel, Julissa Villarreal, Michael Mueller, George Ahrens, Kanwal Bahri, Marc Gollub, Steven
Gold, Jim O’Connor, K Paul Muller, Ravi A. Shankar, Kevin Reick, Peter Heyrman, Dan Hurlimann, Kaveh
Naderi, Nicole Nett, and Hoa Nguyen.

POWER9 and Power10 processor-based systems RAS Page


2
Table of Contents

Trademarks, Copyrights, Notices and Acknowledgements .................. 2


Trademarks .............................................................................................................................. 2
Notices ..................................................................................................................................... 2
Acknowledgements .................................................................................................................. 2
Table of Contents..................................................................................... 3
Introduction ............................................................................................. 8
Section 1: Overview of POWER9 and Power10 Processor-based
systems .................................................................................................... 9
Comparative Discussion ......................................................................................................... 9
Figure 1: POWER9/Power10 Server RAS Highlights Comparison ............................................ 9
Figure 2: Power10 Servers RAS Highlights Comparison ......................................................... 11
IBM Power E1080 System ..................................................................................................... 13
System Structure ................................................................................................................... 13
Figure 3: Power E1080 System Structure Simplified View ....................................................... 13
Power E1080 Processor RAS................................................................................................ 14
Power E1080 System Memory .............................................................................................. 14
Figure 4: DDIMM Memory Features.......................................................................................... 14
Buffer ................................................................................................................................................. 15
OMI .................................................................................................................................................... 15
Memory ECC ..................................................................................................................................... 15
Dynamic Row Repair ........................................................................................................................ 15
Spare Temperature Sensors ............................................................................................................. 15
Unique to 4U DDIMM: Spare DRAMs ............................................................................................... 15
Power Management .......................................................................................................................... 15
Power E1080 SMP Interconnections .................................................................................... 16
Power E1080 Processor to Processor SMP Fabric Interconnect Within a CEC Drawer ..... 16
Power E1080 CEC Drawer to CEC Drawer SMP Fabric Interconnect Design .................... 16
Figure 5: SMP Fabric Bus Slice ................................................................................................ 16
Figure 6: Time Domain Reflectometry ...................................................................................... 17
Power E1080 SMP Fabric Interconnect Design within a CEC Drawer ................................. 17
Internal I/O ............................................................................................................................... 18
IBM Power E1050 System ..................................................................................................... 19
System Structure ................................................................................................................... 19
Figure 7: Power E1050 System Structure Simplified View ....................................................... 20
Power E1050 Processor RAS ............................................................................................... 20
Power E1050 Memory RAS ................................................................................................... 21
Figure 8: E950 DIMM vs E1050 DDIMM RAS Comparison...................................................... 21
Power E1050 I/O RAS ........................................................................................................... 22
Figure 9: Power E1050 I/O Slot Assignments........................................................................... 22
DASD Options.................................................................................................................................. 22
Power E1050 Service Processor ........................................................................................... 23

POWER9 and Power10 processor-based systems RAS Page


3
IBM Power S10xx System ..................................................................................................... 24
System Structure ................................................................................................................... 24
Figure 10: Power S1024 System Structure Simplified View ..................................................... 24
Figure 11: Power S1024 System Structure Simplified View With eSCM.................................. 25
Power S10xx Processor RAS ................................................................................................ 26
Power S10xx Memory RAS ................................................................................................... 26
Figure 12: S9xx ISDIMM vs S10xx DDIMM RAS Comparison ................................................. 26
Power S10xx I/O RAS ........................................................................................................... 27
Figure 13: Power S1024 I/O Slot Assignments......................................................................... 27
DASD Options.................................................................................................................................. 27
Power S10xx Service Processor ........................................................................................... 27
IBM Power® E980 ................................................................................................................... 28
Introduction ............................................................................................................................ 28
Figure 14: Power E980 System Structure ................................................................................. 29
Overview Compared to POWER8 ......................................................................................... 29
Across Node Fabric Bus RAS ............................................................................................... 29
Figure 15: POWER9 Two Drawer SMP Cable Connections .................................................... 30
Clocking ................................................................................................................................. 30
Internal I/O ............................................................................................................................. 31
Figure 16: I/O Connections ....................................................................................................... 31
NVMe Drives .......................................................................................................................... 31
Processor and Memory RAS ................................................................................................. 32
Infrastructure RAS Enhancements ........................................................................................ 32
Power® E950........................................................................................................................... 33
Introduction ............................................................................................................................ 33
Figure 17: Power E950 System Illustration (Four Socket System) ........................................... 33
Figure 18: Power E950 Memory Riser ...................................................................................... 34
I/O ........................................................................................................................................... 34
Figure 19: Power E950 I/O Slot Assignments........................................................................... 34
PCIe CAPI 2.0 ........................................................................................................................ 35
DASD Options ........................................................................................................................ 35
Infrastructure .......................................................................................................................... 35
IBM Power® S914, IBM Power® S922, IBM Power® S924 ................................................. 36
Introduction ............................................................................................................................ 36
System Features .................................................................................................................... 36
Figure 20: Simplified View of 2 Socket Scale-Out Server......................................................... 37
I/O Configurations .................................................................................................................. 37
Figure 21: Scale-out Systems I/O Slot Assignments for 1 and 2 Socket Systems................... 37
Infrastructure and Concurrent Maintenance .......................................................................... 38
POWER9: PCIe Gen3 I/O Expansion Drawer ...................................................................... 40
Figure 22: PCIe Gen3 I/O Drawer RAS Features ..................................................................... 40
Figure 23: PCIe Gen3 I/O Expansion Drawer RAS Feature Matrix .......................................... 41

Section 2: General RAS Philosophy and Architecture ........................ 42


Philosophy .............................................................................................................................. 42
Integrate System Design ....................................................................................................... 42
Figure 24: IBM Enterprise System Stacks (POWER8 Design Point) ....................................... 43
Incorporate Experience.......................................................................................................... 43
Architect for Error Reporting and Fault Isolation ................................................................... 43
Leverage Technology and Design for Soft Error Management ............................................ 44

POWER9 and Power10 processor-based systems RAS Page


4
Deploy Strategic Spare Capacity to Self-Heal Hardware ..................................................... 44
Redundant Definition ...................................................................................................................... 45
Spare Definition ............................................................................................................................... 45
Focus on OS Independence .................................................................................................. 46
Build System Level RAS Rather Than Just Processor and Memory RAS ........................... 46
Error Reporting and Handling .............................................................................................. 46
First Failure Data Capture Architecture ................................................................................. 46
Figure 25: Handled Errors Classified by Severity and Service Actions Required .................... 47
Processor Runtime Diagnostics .................................................................................................... 47
Service Diagnostics in the POWER9 Design ............................................................................... 48
PowerVM Partitioning and Outages ...................................................................................... 48
Section 3: POWER9 Subsystems RAS Details ..................................... 50
Processor RAS Details. ......................................................................................................... 50
Figure 26: POWER6 Processor Design Compared to POWER9 ............................................. 50
Hierarchy of Error Avoidance and Handling...................................................................................... 50
Figure 27: Error Handling Methods Highlights .......................................................................... 50
Processor Module Design and Test ...................................................................................... 55
POWER 9 Memory RAS ......................................................................................................... 56
Memory RAS Fundamentals ................................................................................................. 56
Model Comparative Memory RAS .................................................................................................... 56
Figure 28 : Memory Design Points ............................................................................................ 56
Memory Design Basics ..................................................................................................................... 56
Figure 29: Simplified General Memory Subsystem Layout for 64 Byte Processor Cache Line 57
Figure 30: One Method of Filling a Cache Line Using 2 x8 Industry Standard DIMMs ............ 58
POWER Traditional Design............................................................................................................... 58
1s and 2s Scale-out Design .............................................................................................................. 58
Figure 31 : 1s and 2s Memory Illustrated .................................................................................. 59
Power E950 Design Point ................................................................................................................. 59
Power E980 Design Point ................................................................................................................. 60
Figure 32: Memory Subsystem using CDIMMs ....................................................................... 60
Memory RAS Beyond ECC.................................................................................................... 60
Hypervisor Memory Mirroring ............................................................................................................ 60
Figure 33: Active Memory Mirroring for the Hypervisor ............................................................ 61
Dynamic Deallocation/Memory Substitution ..................................................................................... 61
RAS beyond Processors and Memory ................................................................................ 62
Introduction ............................................................................................................................ 62
Serial Failures, Load Capacity and Wear-out ................................................................................... 62
Common Mode Failures .................................................................................................................... 63
Fault Detection/Isolation and Firmware and Other Limitations ......................................................... 63
Power and Cooling Redundancy Details ............................................................................... 64
Power Supply Redundancy ............................................................................................................... 64
Voltage Regulation ............................................................................................................................ 64
Redundant Clocks.................................................................................................................. 65
Internal Cache Coherent Accelerators .................................................................................. 65
Service Processor and Boot Infrastructure ........................................................................... 65
Trusted Platform Module (TPM) ............................................................................................ 66
I/O Subsystem and VIOS™ ................................................................................................... 66
Figure 34: End-to-End I/O Redundancy .................................................................................... 67
PCIe Gen3 Expansion Drawer Redundancy .................................................................................... 67
Figure 35: Maximum Availability with Attached I/O Drawers .................................................... 68
Planned Outages ................................................................................................................... 68
Updating Software Layers ................................................................................................................. 68

POWER9 and Power10 processor-based systems RAS Page


5
Concurrent Repair ............................................................................................................................. 69
Integrated Sparing ............................................................................................................................. 69
Clustering and Cloud Support .............................................................................................. 70
PowerHA SystemMirror ......................................................................................................... 70
Live Partition Mobility ........................................................................................................................ 70
Figure 36: LPM Minimum Configuration ................................................................................... 70
Minimum Configuration ..................................................................................................................... 71
Figure 37: I/O Infrastructure Redundancy ................................................................................. 71
Figure 38: Use of Redundant VIOS .......................................................................................... 72
PowerVC™ and Simplified Remote Restart.......................................................................... 72
Error Detection in a Failover Environment ............................................................................ 72
Section 4: Reliability and Availability in the Data Center ................... 74
The R, A and S of RAS .......................................................................................................... 74
Introduction ........................................................................................................................................ 74
RAS Defined...................................................................................................................................... 74
Reliability Modeling ................................................................................................................ 74
Figure 39: Rough Reliability Cheat Sheet* ............................................................................... 75
Different Levels of Reliability ............................................................................................................. 75
Costs and Reliability .............................................................................................................. 76
Service Costs .................................................................................................................................... 76
End User Costs ................................................................................................................................. 76
Measuring Availability ............................................................................................................ 77
Measuring Availability ....................................................................................................................... 77
Figure 40: Different MTBFs but same 5 9’s of availability ........................................................ 78
Contributions of Each Element in the Application Stack ................................................................... 78
Figure 41: Hypothetical Standalone System Availability Considerations ................................. 80
Critical Application Simplification ...................................................................................................... 81
Measuring Application Availability in a Clustered Environment ............................................ 81
Figure 42: Ideal Clustering with Enterprise-Class Hardware Example ..................................... 82
Recovery Time Caution ..................................................................................................................... 82
Clustering Infrastructure Impact on Availability ................................................................................. 82
Real World Fail-over Effectiveness Calculations .............................................................................. 83
Figure 43: More Realistic Model of Clustering with Enterprise-Class Hardware ...................... 84
Figure 44: More Realistic Clustering with Non-Enterprise-Class Hardware ............................. 84
Reducing the Impact of Planned Downtime in a Clustered Environment ......................................... 85
HA Solutions Cost and Hardware Suitability ......................................................................... 85
Clustering Resources ........................................................................................................................ 85
Figure 45: Multi-system Clustering Option ................................................................................ 86
Using High Performance Systems .................................................................................................... 86
Cloud Service Level Agreements and Availability ................................................................. 87
Figure 46: Hypothetical Service Level Agreement .................................................................... 87
Figure 47: Hypothetical Application Downtime meeting a 99.99% SLA ................................... 88

Section 5: Serviceability....................................................................... 89
Service Environment .............................................................................................................. 89
Service Interface .................................................................................................................... 89
First Failure Data Capture and Error Data Analysis .............................................................. 90
Diagnostics............................................................................................................................. 90
Automated Diagnostics ..................................................................................................................... 90
Stand-alone Diagnostics ................................................................................................................... 90
Concurrent Maintenance ....................................................................................................... 91
Service Labels ....................................................................................................................... 91
QR Labels .............................................................................................................................. 91

POWER9 and Power10 processor-based systems RAS Page


6
Packaging for Service ............................................................................................................ 91
Error Handling and Reporting ................................................................................................ 92
Call Home .............................................................................................................................. 92
IBM Electronic Services ......................................................................................................... 92
Benefits of ESA ................................................................................................................................. 93
Remote Code Load (RCL) ..................................................................................................... 93
Client Support Portal .............................................................................................................. 93
Summary ................................................................................................ 94
Investing in RAS ................................................................................................................................ 94
Final Word ......................................................................................................................................... 95
About the principal authors/editors: ....................................................................................... 95
Notices: .................................................................................................................................. 96

POWER9 and Power10 processor-based systems RAS Page


7
Introduction
Reliability generally refers to the infrequency of system and component failures experienced by a server.
Availability, broadly speaking, is how the hardware, firmware, operating systems and application designs
handle failures to minimize application outages.
Serviceability generally refers to the ability to efficiently and effectively install and upgrade systems
firmware and applications, as well as to diagnose problems and efficiently repair faulty components when
required.
These interrelated concepts of reliability, availability and serviceability are often spoken of as ”RAS”.
Within a server environment all of RAS, but especially application availability, is really an end-to-end
proposition. Attention to RAS needs to permeate all the aspects of application deployment. However, a
good foundation for server reliability whether in a scale-out or scale-up environment is clearly beneficial.
Systems based on the Power processors are generally known for their design emphasizing Reliability,
Availability and Serviceability capabilities. Previous versions of a RAS whitepaper have been published to
discuss general aspects of the hardware reliability and the hardware and firmware aspects of availability
and serviceability.
The focus of this whitepaper is to introduce the new Power10 processor-based systems and discuss the
POWER9 Processor-based systems using the PowerVM hypervisor. Systems not using PowerVM will not
be discussed specifically in this whitepaper.
This whitepaper is organized into five sections:
Section 1: RAS Overview of key POWER9 And Power10 processor-based systems
An overview of the RAS capabilities of various POWER9 processor-based servers and an overview of the
latest Power10 processor-based systems.
Section 2: General Design Philosophy
A general discussion of Power systems RAS design philosophy, priorities, and advantages.
Section 3: POWER9 Subsystems RAS
A more detailed discussion of each sub-system within a Power server concentrating on the RAS features
of processors, memory, and other components of each system.
Section 4: Reliability and Availability in the Data Center
Discussion of RAS measurements and expectations, including various ways in which RAS may be
described for systems: Mean Time Between Failures, ‘9’s of availability and so forth.
Section 5: Serviceability
Provides descriptions of the error log analysis, call-home capabilities, service environment and service
interfaces of IBM Power servers.

POWER9 and Power10 processor-based systems RAS Page


8
Section 1: Overview of POWER9 and Power10 Processor-
based systems
Comparative Discussion
In September 2021, IBM introduced the first Power system using the Power10 processor: The IBM Power
E1080 system; a scalable server using multiple four socket Central Electronics Complex (CEC) drawers.
The Power E1080 system design was inspired by the Power E980 but has enhancements in key areas to
complement the performance capabilities of the Power10 processor.
One of these key enhancements includes an all-new memory subsystem with Differential DIMMs
(DDIMMs) using a memory buffer that connects to processors using an Open Memory Interface (OMI)
which is a serial interface capable of higher speeds with fewer lanes compared to a traditional parallel
approach.
Another enhancement is the use of both passive external and internal cables for the fabric busses used to
connect processors between drawers eliminating the routing of signals through the CEC backplane in
contrast to the Power9 approach where signals were routed through a backplane and the external cables
were active. This design point significantly reduces the likelihood that the labor intensive and costly
replacement of the main system backplane will be needed.
Another change of note from a reliability standpoint is that the processor clock design, while still
redundant in the Power E1080 system has been simplified since it is no longer required that each
processor module within a CEC drawer be synchronized with the others.

Figure 1: POWER9/Power10 Server RAS Highlights Comparison

POWER9 POWER9
POWER9 Power10
1s and 2s IBM Power
IBM Power IBM Power
System
Systems^ System E980 E1080
E950

Base POWER9™ Processor RAS features including


• First Failure Data Capture
• Processor Instruction Retry
• L2/L3 Cache ECC protection with cache
Yes* Yes Yes Yes
line-delete
• Power/cooling monitor function
integrated into on chip controllers of
processors

POWER9 Enterprise RAS Features


No Yes Yes Yes
• Core Checkstops

Yes – Power10
Multi-node SMP Fabric RAS design removes
• CRC checked processor fabric bus retry active
N/A N/A Yes
with spare data lane and/or bandwidth components on
reduction cable and
introduces

POWER9 and Power10 processor-based systems RAS Page


9
internal cables to
reduce
backplane
replacements

PCIe hot-plug with processor integrated PCIe


Yes Yes Yes Yes
controller$

Memory DIMM ECC supporting x4 Chipkill* Yes Yes Yes Yes

Uses IBM memory buffer and has spare DRAM Yes – New
No Yes Yes
module capability with x4 DIMMs* Memory Buffer

x8 DIMM support with Chipkill correction for marked


N/A N/A Yes N/A
DRAM*

Custom DIMM support with additional spare DRAM Yes, New


No No Yes
capability* Custom DIMM

Yes -
Active Memory Mirroring for the Hypervisor No Yes - Base Yes- Base
Feature

Yes, For
Processors-
Redundant/spare voltage phases on voltage DDIMMs use on
Both redundant
converters for levels feeding processor and custom No Redundant board power
and spare
memory DIMMs or memory risers. management
integrated
Circuits (PMICs)

Yes – New
Redundant processor clocks No No Yes
Design

Redundant service processor and related boot


No No Yes Yes
facilities

Redundant TPM capability No No Yes Yes

Transparent Memory Encryption No No No Yes

Multi-node support No No Yes Yes

* In scale-out systems Chipkill capability is per rank of a single Industry Standard DIMM
(ISDIMM); in IBM Power E950 Chipkill and spare capability is per rank spanning across an
ISDIMM pair; and in the IBM Power E980, per rank spanning across two ports on a
Custom DIMM.
The Power E950 system also supports DRAM row repair
^ IBM Power® S914, IBM Power® S922, IBM Power® S924
IBM Power® H922 ,IBM Power® S924, IBM Power® H924
$
Note: I/O adapter and device concurrent maintenance in this document refers to the hardware capability
to allow the adapter or device to be removed and replaced concurrent with system operation. Support
from the operating system is required and will depend on the adapter or device and configuration
deployed.

POWER9 and Power10 processor-based systems RAS Page


10
Figure 2: Power10 Servers RAS Highlights Comparison

Power10
Power10 Power10
1s and 2s
IBM Power System IBM Power System
IBM Power System E1050 E1080
S10xx

Base Power10™ Processor RAS features including


• First Failure Data Capture
• Processor Instruction Retry
• L2/L3 Cache ECC protection with cache
Yes Yes Yes
line-delete
• Power/cooling monitor function
integrated into on chip controllers of
processors

POWER10 Enterprise RAS Features


Yes Yes Yes
• Core Checkstops

Yes – Power10 design


removes active
Multi-node SMP Fabric RAS components on cable
• CRC checked processor fabric bus retry with N/A N/A and introduces internal
spare data lane and/or bandwidth reduction cables to reduce
backplane
replacements

PCIe hot-plug with processor integrated PCIe Yes Yes – with blindswap Yes – with blindswap
controller No Cassette cassette cassette

Memory DIMM ECC supporting x4 Chipkill* Yes Yes Yes

2U DDIMM – no
spare DRAM Yes - base Yes - base
4U DDIMM (post 4U DDIMM – 2 spare 4U DDIMM – 2 spare
Dynamic Memory Row repair and spare DRAM
GA) – 2 spare DRAM per rank DRAM per rank
capability
DRAM per rank
Yes – Dynamic Row Yes – Dynamic Row
Yes – Dynamic Row Repair Repair
Repair

Yes – Base
Active Memory Mirroring for the Hypervisor Yes - Base Yes - Base
New to scaleout

Redundant/spare voltage phases on voltage Yes Yes


No
converters for levels feeding processor N+1 N+2

POWER9 and Power10 processor-based systems RAS Page


11
No with
Yes – Base
Redundant On-board Power Management Integrated 2U DDIMM Yes - Base
4U DDIMM
Circuits (PMIC) memory DDIMMs 4U DDIMM
Yes – optional 4U
DDIMM (post GA)

Enterprise Base Enterprise Base


Board Controller Board Controller Flexible Service
Service Processor Type (eBMC) – open (eBMC) – open Processor (FSP) – IBM
standard with standard with proprietary
Redfish support Redfish support

Integrated Spare
Processor Clocks Redundancy/Sparing No Redundant
(post GA)

Redundant service processor and related boot


No No Yes
facilities

Redundant TPM capability No No Yes

Transparent Memory Encryption Yes Yes Yes

Multi-node support N/A N/A Yes

Yes – Op Panel
No – Op Panel Base Yes – Op Panel Base
Concurrent Op-Panel Repair Base
Yes – LCD Yes - LCD
Yes - LCD
(post GA)

TOD Battery Concurrent Maintenance No Yes Yes

Lots of internal Very Few internal


Internal Cables Few internal cables
cables cables

POWER9 and Power10 processor-based systems RAS Page


12
IBM Power E1080 System
As announced, the Power E1080 system is designed to be capable of supporting multiple drawers each
containing 4 processor sockets. The initial product offering will be limited to two drawers.
In addition to these CEC drawers, the design supports a system control drawer which contains the global
service processor modules. Each CEC drawer supports 8 PCIe slots and I/O expansion drawers are also
supported.
System Structure
A rough view of the Power E1080 system design is represented in the figure below:
Compared to the Power E980 the most visible changes are in the memory sub-system and the fabric
busses that connect the processors.

Figure 3: Power E1080 System Structure Simplified View

Not illustrated in the figure above is the internal connections for the SMP fabric busses which will be
discussed in detail in another section.
In comparing the Power E1080 design to the POWER9-based Power E980 system, it is also interesting to
note that the processor clock function has been separated from the local service functions and now
resides in a dual fashion as separate clock cards.
The POWER8 multi-drawer system design required that all processor modules be synchronized across all
CEC drawers. Hence a redundant clock card was present in the system control drawer and used for all
the processors in the system.

POWER9 and Power10 processor-based systems RAS Page


13
In POWER9 only each CEC drawer was required to be synchronized using the same reference clock
source. In the Power E1080, each processor can run asynchronous to all the other processors and use a
separate/different reference clock source. While a clock and redundant clock is provided in each node
(rather than for each processor) there is no need for logic to keep them coordinated. This allows for a
significantly simpler clock design.

Power E1080 Processor RAS


While there are many differences internally in the Power10 processor compared to POWER9 that relate
to performance, number of cores and so forth, the general RAS philosophy for how errors are handled
has remained largely the same.
The section on POWER9 subsystem RAS, therefore, can still be referenced for understanding the design.
Two items which have been changed are the memory controller function and the SMP connections
between the processors. These will be discussed in the next two sub-sections.
The processor clocking has also changed as previously discussed.

Power E1080 System Memory


Figure 4: DDIMM Memory Features

The memory subsystem of the Power E1080 system has been completely redesigned to support
Differential DIMMs (DDIMMs) with DDR4 memory that leverage a serial interface to communicate
between processors and the memory.
A memory DIMM is generally considered to consist of 1 or more “ranks” of memory modules (DRAMs). A
standard DDIMM module may consist of 1 or 2 ranks of memory and will be approximately 2 standard
“rack units” high, called 2U DDIMM. The Power E1080 system exclusively uses a larger DDIMM with up
to 4 ranks per DDIMM (called a 4U DDIMM). This allows for not only additional system capacity but also
room for additional RAS features to better handle failures on a DDIMM without needing to take a repair
action (additional self-healing features).

POWER9 and Power10 processor-based systems RAS Page


14
The components of the memory interface include the memory controller on the processors, the open
memory interface (OMI) that a memory controller uses to communicate with the memory buffer on each
DIMM, the DRAM modules that support a robust error detection and correction capability (ECC), the on-
DIMM power management integrated circuits (PMICs) and thermal monitoring capabilities.
These components are considered below looking first at what is standard for both 2U and 4U DDIMMs
and then what is unique for the 4U DDIMMs the Power E1080 systems use.
Buffer
The DDIMM incorporates a new Microchip memory buffer designed to IBM's RAS specifications. Key
RAS features of the buffer include protection of critical data/address flows using CRC, ECC and parity, a
maintenance engine for background memory scrubbing and memory diagnostics, and a Fault Isolation
Register (FIR) structure which enables firmware attention-based fault isolation and diagnostics.
Unlike POWER9 systems, this memory buffer does not contain an L4 cache.
OMI
The OMI interface between the memory buffer and processor memory controller is protected by dynamic
lane calibration, as well as a CRC retry/recovery facility to re-transmit lost frames to survive intermittent
bit flips. A complete lane fail can also be survived by triggering a dynamic lane reduction from 8 to 4,
independently for both up and downstream directions. A key advantage of the OMI interface is that it
simplifies the number of critical signals that must cross connectors from processor to memory compared
to a typical industry standard DIMM design.
Memory ECC
The DDIMM includes a robust 64-byte Memory ECC, with 8-bit symbols, capable of correcting up to five
symbol errors (one x4 chip and one additional symbol), as well as retry for data and address
uncorrectable errors.
Dynamic Row Repair
To further extend the life of the DDIMM, the dynamic row repair feature can restore full use of a DRAM for
a fault contained to a DRAM row, while system continues to operate.
Spare Temperature Sensors
Each DDIMM provides spare temperature sensors, such that the failure of one does not require a DDIMM
replacement.
Unique to 4U DDIMM: Spare DRAMs
4U DDIMMs used in the Power E1080 system include two spare x4 memory modules (DRAMs) per rank.
These can be substituted for failed DRAMs during runtime operation. Combined with ECC correction, the
2 spares allow the 4U DDIMM to continue to function with 3 bad DRAMs per rank, compared to 1 (single
device data correct) or 2 (double device data correct) bad DRAMs in a typical industry standard DIMM
design.
This extends self-healing capabilities beyond what is provided with dynamic row repair capability.
Power Management
In the Power E980, voltage regulator modules (VRMs) in the CEC drawers were separately used to
provide different voltage levels to the CDIMMs where the levels would be used or further divided on the
CDIMM.
The DDIMMs used in the Power E1080 use power management ICs (PMICs) to divide voltages and
provide the power management for the DDIMMs. Separate memory VRMs are no longer used.
The 4U DDIMMs also include spare power management ICs (PMICs) such that the failure of one PMIC
does not require a DDIMM replacement.

POWER9 and Power10 processor-based systems RAS Page


15
Power E1080 SMP Interconnections
Power E1080 Processor to Processor SMP Fabric Interconnect Within a CEC
Drawer
To communicate between processors within a CEC drawer, the Power E1080 uses a fabric bus
composed of eight bi-directional lanes of data. The physical busses are run from processor to processor
through the CEC drawer main backplane.
The data transferred is CRC checked. Intermittent errors can result in retry of operation. During run-time
should a persistent error occur on a bus, the system can reduce from using eight lanes to four lanes. This
capability is called “½ bandwidth mode”.

Power E1080 CEC Drawer to CEC Drawer SMP Fabric Interconnect Design
The SMP Fabric busses used to connect processors across CEC nodes is similar in RAS function to the
fabric bus used between processors within a CEC drawer. Each bus is functionally composed of eight bi-
directional lanes of data. CRC checking with retry is also used. ½ bandwidth mode is supported.
Unlike the processor-to-processor within a node design, the lanes of data are carried from each
processor module through internal cables to external cables and then back through internal cables to the
other processor.
Physically each processor module has eight pads (four on each side of the module.) Each pad side has
an internal SMP cable bundle which connects from the processor pads to a bulkhead in each CEC drawer
which allows the external and internal SMP cables to be connected to each other.
Figure 5: SMP Fabric Bus Slice

Where the illustration above shows just the connections on one side of the processor.
In addition to connecting with the bulkhead, each cable bundle also has connections to an SMP cable
Validation Card which has logic used to verify the presence and type of a cable to help guide the
installation of cables in a system.

POWER9 and Power10 processor-based systems RAS Page


16
Since physical cables are used, each bus also employs a spare data lane. This can be used to substitute
for a failed lane without the need to enter ½ bandwidth mode or the need to schedule a repair.
The ability to concurrently repair an external cable that failed during run-time before a system is rebooted
is also supported.
One key difference in this design compared to SMP9 is that the external SMP cable is no longer an active
cable. This can be an advantage in reducing the number of components that can fail on the cable, but it
does make it harder to isolate the root cause of a cable fault.
To maintain error isolation with the new cost-effective design, an advanced diagnostic capability called
Time Domain Reflectometry (TDR) is built into the Power10 processor and a test performed whenever a
bus goes into half-bandwidth mode.
Figure 6: Time Domain Reflectometry

Though it is beyond the scope of this whitepaper to delve into the exact details of how TDR works, as a
very rough analogy it can be likened to a form of sonar where when desired, a processor module that
drives a signal on a lane can generate an electrical pulse along the path to a receiving processor in
another CEC drawer. If there is a fault along the path, the driving processor can detect a kind of echo or
reflection of the pulse. The time it takes for the reflection to be received would be indicative of where the
fault is within the cable path.
For faults that occur mid-cable, the timing is such that TDR should be able to determine exactly what field
replaceable unit to replace to fix the problem. If the echo is very close to a connection, two FRUs might
be called out, but in any case, the use of TDR allows for good fault isolation for such errors while allowing
the Power10 system to take advantage of a fully passive path between processors.

Power E1080 SMP Fabric Interconnect Design within a CEC Drawer


Unlike the SMP fabric busses between processor drawers, the connections between processors are still
routed processor-to-processor within the drawer “planar” or motherboard card instead of cables. The
design maintains a CRC checking with retry and the ability to go into a ½ bandwidth mode if needed.

POWER9 and Power10 processor-based systems RAS Page


17
Internal I/O
The Power E1080 supports 8 PCIe Gen4/Gen5 PCIe slots in each CEC drawer. Up to 4 optional NVMe
drives are supported. USB support for the System Control Unit, if desired can be provided by using a
USB adapter in one of the 8 PCIe slots and a USB cable connected to the System Control Drawer.
To determine the exact processor to slot assignments, refer to the system documentation or “redbook” for
the Power E1080.
Note, however that location codes have been changed for the CEC drawers.
Power9 used location codes that began with 1, e.g., P1-C1 would refer to the first planar in a unit and the
first connector. In Power10, location codes used to identify elements within the system now begin with a
zero., e.g., P0-C0 would be the first planar, first connector.

POWER9 and Power10 processor-based systems RAS Page


18
IBM Power E1050 System
The Power E1050 system is designed to support 2-socket, 3-socket and 4-socket processor
configurations. The 2-socket system can be upgraded to 3-socket or 4-socket in the field or at the
customer site. Each processor socket is a Dual Chip Module (DCM), which consists of 2 Power10
processor chips. A DCM connects to 16 memory channels, which equates to 64 DDIMMs per fully
populated E1050 system.
As with the E1080, the E1050 customers will benefit from the highly reliable 4U DDIMM. With memory
RAS features like redundant PMIC, withstand multiple Chipkill events, dynamic DRAM row repair and
OMI channel half bandwidth mode, the 4U DDIMM offers in-memory databases very high application
availability.
The E1050 introduces the Enterprise Baseboard Controller (eBMC) as the service processor. This is a
departure from the IBM proprietary FSP which is used in the E1080 and previous IBM Power enterprise
servers. The E1050 eBMC design essentially brings the unmatched enterprise RAS features of the FSP
architecture to the industry standard BMC. The eBMC supports the Redfish API which is an open
standard designed for simple and secure management of hybrid cloud infrastructure. This allows the
E1050 to be easily integrated into any hybrid cloud environment with simplified system serviceability.
The E1050 brings a significantly higher performing machine to the same form factor as the E950 while
maintaining the best-in-class RAS. There are more processor cores per die, faster processor
interconnect, higher memory bandwidth, more PCIe Gen4/5 slots, to name but a few. These components
are protected by CRC and/or ECC with retry capability where appropriate. The use of infrastructure
redundancy and concurrent maintenance or hotplug is designed in critical components.

System Structure
A simplified view of the Power E1050 system design is represented in the figure below:
The E1050 maintains the same system form factor and infrastructure redundancy as the E950. As
depicted in the E1050 system diagram below, there are 4 Power Supplies and Fan Field Replaceable
Units (FRU) to provide at least N+1 redundancy. These components can be concurrently maintained or
hot add/removed. There is also N+1 Voltage Regulation Module (VRM) phase redundancy to the
Processors and redundant Power Management Integrated Circuit (PMIC) supplying voltage to the 4U
DDIMM that the E1050 offers.
The E1050 Op Panel base and LCD are connected to the same planer as the internal NVMe drives. The
Op Panel base and LCD are separate FRUs and are concurrently maintainable. The NVMe backplane
also has 2 USB 3.0 ports, accessible through the front of the system, for system OS. Not shown in the
diagram, are 2 additional OS USB 3.0 ports at the rear of the system, connected through the eBMC card.

POWER9 and Power10 processor-based systems RAS Page


19
Figure 7: Power E1050 System Structure Simplified View

USB Op Panel
Ports Base

N N N N N N N N N N
Fan FRU Fan FRU Fan FRU Fan FRU V V V V V V V V V V
M M M M M M M M M M LCD
e e e e e e e e
DASD Backplane
Fan
Fan
Rotor Rotor
Fan
Rotor
Fan
Rotor
Fan
Rotor
Fan
Rotor
Fan
Rotor
Fan
Rotor
e e

Processor Module Processor Module Processor Module Processor Module


TPM +
USB Cntr
P10 P10 P10 P10 P10 P10 P10 P10 Concurrently replaceable
Proc0 Proc1 Proc0 Proc1 Proc0 Proc1 Proc0 Proc1

Concurrently replaceable and collectively


Other can provide at least n+1 redundancy
Service
Functions
Concurrently replaceable (some
components shown)
Clock
Circuitry
Subcomponent at least n+1 redundancy

Power
Distribution Service Processor V V Processor in Dual Chip Module
Battery

V V V V V V V V V V V V V V
RTC

Card eBMC R R
R R R R R R R R R R R R R R
SYS M M M M M M M M M M M M M M M M
VPD
Memory DDIMMs

System Planar
Main Planar Board

Proc CP3 Proc CP0 Proc CP2 Proc CP1


Other components
To
To

PCIe Slot C11


PCIe Slot C10
PCIe Slot C2

PCIe Slot C3

PCIe Slot C5

PCIe Slot C6

PCIe Slot C7

PCIe Slot C8

PCIe Slot C9
PCIe Slot C1

PCIe Slot C4
NVMe2,7,3,8,4,9
NVMe0,5,1,6

System VPD

No PCIe Switches – all slots have direct


Gen4x8 Gen5x8 or Gen4x16 connection to processor
All NVMe All NVMe
@ @
Gen4 Gen4 Gen5x8
Gen5x8 or NVMe(3,4) and NVMe(8,9) are
Gen4x8
Gen4x16 recommended pairs for OS mirror
Pwr Pwr Pwr Pwr
Supply Supply Supply Supply

Power E1050 Processor RAS


The E1050 processor module is a DCM which differs from that of the E950 which has Single Chip Module
(SCM). Each DCM has 30 processor cores, which is 120 cores for a 4S E1050 system. In comparison, a
4S E950 system supports 48 cores. The internal processor buses are twice as fast with the E1050
running at 32Gbs. Despite the increased cores and the faster high speed processor bus interfaces, the
RAS capabilities are essentially equivalent. With features like Processor Instruction Retry (PIR), L2/L3
Cache ECC protection with cacheline delete and CRC fabric bus retry that is a characteristic of P9 and
P10 processors. As in the E950, when an internal or external fabric bus lane encounters a hard failure,
the lane can be dynamically spared out with no impact to system availability.

POWER9 and Power10 processor-based systems RAS Page


20
Power E1050 Memory RAS
Unlike the processor RAS characteristics, the E1050 memory RAS varies significantly from that of the
E950. The E1050 supports the same 4U DDIMM as the E1080.
The memory comparison DIMM table below highlights the differences between the E950 DIMM and the
E1050 DDIMM. It also provides the RAS impacts of the DDIMMs, which are applicable to the E1080
servers. For more description on the DDIMM, refer to the E1080 System Memory section of this
document.
Figure 8: E950 DIMM vs E1050 DDIMM RAS Comparison

POWER9 Power10
RAS impact
E950 Memory E1050 Memory
• P10 4U DDIMM: Single FRU or fewer components to
Riser card replace
DIMM Form Factor 4U DDIMM • E950 DIMM: Separate FRU used for memory buffer on riser
plus ISDIMMs
card and the ISDIMMs

• P10 4U DDIMM: Data pin fail (1 symbol) lining up with


Single single cell fail on another DRAM is still correctable
Dual Symbol
symbol Correction Symbol • E950 DIMM: Data pin fail (1 symbol) lining up with single
correction
correction cell fail on another DRAM is uncorrectable

• P10 4U DDIMM
o 1st chip kill fixed with spare
o 2nd chip kill fixed with spare
One spare o 3rd chip kill fixed with ECC
DRAM per Two spare o 4th chip kill is uncorrectable
X4 Chip Kill
port or across DRAM per port • E950 DIMM
a DIMM pair o 1st chip kill fixed with spare
o 2nd chip kill fixed with ECC
o 3rd chip kill is uncorrectable

• P10 4U DDIMM: Detect, fix, and restore at runtime without


system outage
DRAM Row Repair Static Dynamic • E950 DIMM: Detect at runtime, but fix and restore requires
system reboot
• P10 4U DDIMM: Avoids L4 cache failure modes
L4 Cache Yes No • E950 DIMM: L4 cache fails contribute to DIMM
replacements
• P10 4U DDIMM: can survive a voltage regulation component
failure
Voltage Regulation • E950 DIMM: voltage regulation and associated components
No Yes
Redundancy are single point of failure

NOTE: A memory ECC code is defined by how many bits or symbols (group of bits) it can correct. The
P10 DDIMM memory buffer ECC code organizes the data into 8-bit symbols and each symbol contains
the data from one DRAM DQ over 8 DDR beats.

POWER9 and Power10 processor-based systems RAS Page


21
Power E1050 I/O RAS
The E1050 provides 11 general purpose PCIe slots that allows for hot plugging of IO adapters. These
PCIe slots operate at Gen4 and Gen5 speeds. As shown in the table below, some of the PCIe slots
support I/O expansion drawer cable cards.
Unlike the E950, the E1050 location codes start from index 0, as with all Power 10 systems. However,
slot c0 is not a general purpose PCIe slot and it’s reserved for the eBMC Service Processor card.
Another difference between the E950 and the E1050, is that all the E1050 slots are directly connected to
a P10 processor. In the E950 some slots are connected to the P9 processor by way of I/O switches.
All 11 general purpose PCIe slots are available if 3S or 4S DCM are populated. In the 2S DCM
configuration, only 7 PCIe slots are functional.

Figure 9: Power E1050 I/O Slot Assignments

Slot Type From Supports


C1 x8 G4 CP0 = DCM0/C0 PCIe Adapters, Cable card for I/O expansion
C2 x8 G5/x16 G4 CP2 = DCM2/C1 PCIe Adapters, Cable card for I/O expansion
C3 x8 G5/x16 G4 CP2 = DCM2/C1 PCIe Adapters, Cable card for I/O expansion
C4 x8 G5/x16 G4 CP2 = DCM2/C0 PCIe Adapters, Cable card for I/O expansion
C5 x8 G5/x16 G4 CP2 = DCM2/C0 PCIe Adapters, Cable card for I/O expansion
C6 x8 G4 CP1 = DCM1/C1 PCIe Adapters
C7 x8 G5 CP1 = DCM1/C1 PCIe Adapters, Cable card for I/O expansion
C8 x8 G5/x16 G4 CP1 = DCM1/C1 PCIe Adapters, Cable card for I/O expansion
C9 x8 G4 CP1 = DCM1/C0 PCIe Adapters
C10 x8 G5 CP1 = DCM1/C0 PCIe Adapters, Cable card for I/O expansion
C11 x8 G5/x16 G4 CP1 = DCM1/C0 PCIe Adapters, Cable card for I/O expansion

DASD Options
The E1050 provides 10 internal NVMe drives at Gen4 speeds. The NVMe drives are connected to DCM0
and DCM3. In a 2S DCM configuration, only 6 of the drives are available. A 4S DCM configuration is
required to have access to all 10 internal NVMe drives. Unlike the E950, the E1050 has no internal SAS
drives. An external drawer can be used to provide SAS drives.

The internal NVMe drives support OS-controlled RAID0 and RAID1 array, but no hardware RAID. For
best redundancy, the OS mirror and dual VIOS mirror can be employed. To ensure as much separation
as possible in the hardware path between mirror pairs, the following NVMe configuration is
recommended:
a.) Mirrored OS: NVMe3,4 or NVME8,9 pairs
b.) Mirrored Dual VIOS
I. Dual VIOS: NVMe3 for VIOS1, NVMe4 for VIOS2
II. Mirrored the Dual VIOS: NVMe9 mirrors NVMe3, NVMe8 mirrors NVMe4

POWER9 and Power10 processor-based systems RAS Page


22
Power E1050 Service Processor

The IBM Power10 E1050 system comes with a redesigned service processor based on a Baseboard
Management Controller (BMC) design with firmware that is accessible through open-source industry
standard APIs, such as Redfish. An upgraded Advanced System Management Interface (ASMI) web
browser user interface preserves the required enterprise RAS functions while allowing the user to perform
tasks in a more intuitive way.

Equipping the industry standard BMC with enterprise service processor functions that are characteristic of
FSP based systems, like the E1080, has led to the name Enterprise BMC (eBMC). As with the FSP, the
eBMC runs on its own power boundary and does not require resources from a system processor to be
operational to perform its tasks.

The service processor supports surveillance of the connection to the Hardware Management Console
(HMC) and to the system firmware (hypervisor). It also provides several remote power control options,
environmental monitoring, reset, restart, remote maintenance, and diagnostic functions, including console
mirroring. The BMC service processors menus (ASMI) can be accessed concurrently during system
operation, allowing nondisruptive abilities to change system default parameters, view and download error
logs, check system health.

Redfish, an Industry standard for server management, enables the Power Servers to be managed
individually or in a large data center. Standard functions such as inventory, event logs, sensors, dumps,
and certificate management are all supported with Redfish. In addition, new user management features
support multiple users and privileges on the BMC via Redfish or ASMI. User management via
Lightweight Directory Access Protocol (LDAP) is also supported. The Redfish events service provides a
means for notification of specific critical events such that actions can be taken to correct issues. The
Redfish telemetry service provides access to a wide variety of data (e.g. power consumption, ambient,
core, DDIMM and I/O temperatures, etc.) that can be streamed on periodic intervals.

POWER9 and Power10 processor-based systems RAS Page


23
IBM Power S10xx System
The Power S10xx are the Power10 Scaled out servers. These systems are offered in the 1S and 2S
DCM configuration with 2U and 4U CEC drawer options.
The S10xx servers have more than doubled the performance of the S9xx servers in the same form factor.
As with the E1050, the S10xx DCM provides up to 30 processor cores and 16 memory OMI channels.
Despite the significantly increased performance, the S10xx design improves on or maintains equivalent
RAS to the S9xx servers.

System Structure
There are multiple scale out system models (MTMs) supported. For brevity, this document focuses on
the largest configuration of the scale out servers.
The simplified illustration, in Figure 10, depicts the 2S DCM with 4U CEC drawer. Similar to the S9xx,
there is infrastructure redundancy in the power supplies and fans. In addition, these components can be
concurrently maintained along with the Op Panel base, Op Panel LCD, internal NVMe drives and IO
adapters.

Figure 10: Power S1024 System Structure Simplified View

USB Op Panel
Ports Base

N N N N N N N N N N N N N N N N
V V V V V V V V V V V V V V V V Concurrently replaceable
M M M M M M M M M M M M M M M M LCD
e e e e e e e e e e e e e e e e DASD Backplane Concurrently replaceable and
collectively can provide at least n+1
redundancy

Fan FRU Fan FRU Fan FRU Fan FRU Fan FRU Fan FRU Concurrently replaceable (some
Fan
Rotor
Fan
Rotor Fan Fan
Fan
Rotor
Fan Fan Fan Fan Fan Fan
Rotor
Fan
Rotor
components shown)
Rotor Rotor Rotor Rotor Rotor Rotor
Rotor

Subcomponent at least n+1 redundancy

Processor Module Processor Module Processor in Dual Chip Module

Other V V
TPM + P10 P10 P10 P10
Service R R
USB Cntr Proc0 Proc1 Proc0 Proc1
Functions M M Memory DDIMMs

Clock
Circuitry Main Planar Board

Other components

Service Processor
Battery
RTC

eBMC System VPD


SYS
VPD

System Planar
No PCIe Switches – all I/O slots have direct
connection to processor
Proc CP1 Proc CP0

NVMe internal storage card is supported in


PCIe Slot C6

PCIe Slot C10

PCIe Slot C11


PCIe Slot C0

PCIe Slot C1

PCIe Slot C7

PCIe Slot C8
PCIe Slot C2

PCIe Slot C9

only 3 slots: C8, C10 and C11. However, a


PCIe Slot C3

PCIe Slot C4

JBOF card can not be plugged in C8 and


C10 at the same time.

Gen4x8 Gen5x8 Gen5x8 Gen5x8 Open CAPI Gen5x8 Gen4x8 Gen5x8 Gen5x8 or
Gen5x8 Gen5x8
or or Only or NVMe card
or or
Gen4x16 Gen4x16 NVMe
Gen4x16 Gen4x16
card
or
NVMe
Pwr Pwr Pwr Pwr card
Supply Supply Supply Supply

POWER9 and Power10 processor-based systems RAS Page


24
As depicted in the simplified illustration in Figure 11, there is another variation of the S10xx DCM. This
DCM option, which is unique to the scale out servers, contain only 1 P10 chip with processor cores and
OMI memory channels. The other P10 chip within the DCM, primary purpose is to provide PCIe
connections to I/O devices. This P10 entry Single Chip Module (eSCM) processor configuration gives
customers the option to purchase cost reduced solution without losing any I/O adapter slots. A point of
note is if the primary processor chip of the eSCM is nonfunctional and garded by the firmware, the
associated I/O only chip will be deconfigured as well and all the attached I/O slots will be unavailable.

Figure 11: Power S1024 System Structure Simplified View With eSCM

USB Op Panel
Ports Base

N N N N N N N N N N N N N N N N
V V V V V V V V V V V V V V V V Concurrently replaceable
M M M M M M M M M M M M M M M M LCD
e e e e e e e e e e
e e e e e e DASD Backplane
Concurrently replaceable and
collectively can provide at least n+1
redundancy

Fan FRU Fan FRU Fan FRU Fan FRU Fan FRU Fan FRU
Concurrently replaceable (some
Fan Fan
Fan Fan
Fan Fan
Fan Fan
Fan Fan Fan Fan
Rotor
components shown)
Rotor Rotor Rotor Rotor Rotor Rotor Rotor
Rotor Rotor Rotor Rotor

Subcomponent at least n+1 redundancy

Processor Module Processor Module Processor in Dual Chip Module

Other V V
TPM + P10 P10 P10 P10
Service R R
USB Cntr Proc0 Proc1 Proc0 Proc1
Functions M M Memory DDIMMs
0-cores 0-cores

Clock
Circuitry Main Planar Board

Other components

Service Processor
Battery
RTC

eBMC System VPD


SYS
VPD System Planar

No PCIe Switches – all I/O slots have direct


connection to processor
Proc CP1 Proc CP0

NVMe internal storage card is supported in


only 3 slots: C8, C10 and C11. However, a
PCIe Slot C6

PCIe Slot C10

PCIe Slot C11


PCIe Slot C0

PCIe Slot C1

PCIe Slot C7

PCIe Slot C8
PCIe Slot C2

PCIe Slot C9
PCIe Slot C3

PCIe Slot C4

JBOF card can not be plugged in C8 and


C10 at the same time.

Gen4x8 Gen5x8 Open CAPI Gen5x8 Gen4x8 Gen5x8 Gen5x8 or


Gen5x8 Gen5x8 Gen5x8 Gen5x8
or or Only or NVMe card
or or
Gen4x16 Gen4x16 NVMe
Gen4x16 Gen4x16
card
or
NVMe
Pwr Pwr Pwr Pwr card
Supply Supply Supply Supply

POWER9 and Power10 processor-based systems RAS Page


25
Power S10xx Processor RAS
The S10xx systems use the same DCM processor module as the E1050. Refer to the E1050 processor
RAS section and the general Power 10 RAS comparison tables for more details.

Power S10xx Memory RAS


The S10xx memory is very different from that of the S9xx systems. The S10xx scale out systems now
support Active Memory Mirroring for the Hypervisor, which was not available in the S9xx servers. While
an S9xx uses ISDIMMs, the S10xx supports the 2U customed DDIMM. With post GA firmware, the 4U
models of the S10xx will offer the 4U DDIMM. The 2U DDIMM is a reduced cost and RAS version of the
4U DDIMM offered with the E1080 and E1050 systems. The 2U DDIMMs RAS characteristics were
discussed in some details in the E1080 section of this document.

Figure 12: S9xx ISDIMM vs S10xx DDIMM RAS Comparison

POWER9 Power10
RAS impact
S9xx Memory S10xx Memory

Direct attached • P10 2U DDIMM: Single FRU or fewer components to replace


DIMM Form Factor 2U • S9xx DIMM: Separate FRU used for the ISDIMMs
ISDIMMs

• P10 2U DDIMM: Data pin fail (1 symbol) lining up with single


Single Symbol Dual Symbol cell fail on another DRAM is still correctable
Symbol Correction • S9xx DIMM: Data pin fail (1 symbol) lining up with single cell
correction correction
fail on another DRAM is uncorrectable

• P10 2U DDIMM
Single DRAM Single DRAM o 1st chip kill fixed with ECC
chipkill chipkill o 2nd chip kill is uncorrectable
X4 Chip Kill • S9xx DIMM
correction, but correction, but
No spare DRAM No spare DRAM o 1st chip kill fixed with ECC
o 2nd chip kill is uncorrectable

• P10 2U DDIMM: Detect, fix, and restore at runtime without


system outage
DRAM Row Repair No Dynamic • S9xx ISDIMM: Neither Static nor Dynamic row repair
supported
• P10 2U DDIMM: voltage regulation and associated
components are single point of failure
Voltage Regulation • S9xx DIMM: voltage regulation and associated components are
No No
Redundancy single point of failure

NOTE: A memory ECC code is defined by how many bits or symbols (group of bits) it can correct. The
P10 DDIMM memory buffer ECC code organizes the data into 8-bit symbols and each symbol contains
the data from one DRAM DQ over 8 DDR beats.

POWER9 and Power10 processor-based systems RAS Page


26
Power S10xx I/O RAS
The S1024 provides 10 general purpose PCIe slots that allows for hot plugging of IO adapters. Some of
these PCIe slots operate at Gen5 speeds, while a few are limited to Gen4 speeds. As shown in Figure
13, some of the PCIe slots support NVMe JBOF (Just a Bunch Of Flash) cable card and I/O expansion
drawer cable card.
Both DCMs must be installed to have connection to all 10 PCIe I/O slots. If only one DCM is installed, the
1S processor module drives 5 general purpose PCIe slots (C7 to C11).

Figure 13: Power S1024 I/O Slot Assignments

Slot Type From Supports


C0 x8 G5/x16 G4 CP1 = DCM1/C1 PCIe Adapters, Cable card for I/O expansion
C1 x8 G4 CP1 = DCM1/C1 PCIe Adapters
C2 x8 G5 CP1 = DCM1/C1 PCIe Adapters, Cable card for I/O expansion
C3 x8 G5/x16 G4 CP1 = DCM1/C0 PCIe Adapters, Cable card for I/O expansion
C4 x8 G5/x16 G4 CP1 = DCM1/C0 PCIe Adapters, Cable card for I/O expansion
C7 x8 G5 CP0 = DCM0/C1 PCIe Adapters, Cable card for I/O expansion
C8 x8 G4 CP0 = DCM0/C1 PCIe Adapters, NVMe JBOF card
C9 x8 G5 CP0 = DCM0/C1 PCIe Adapters, Cable card for I/O expansion
C10 x8 G5/x16 G4 CP0 = DCM0/C0 PCIe Adapters, Cable card for I/O expansion, NVMe
JBOF
C11 x8 G5 CP0 = DCM0/C0 PCIe Adapters, Cable card for I/O expansion, NVMe
JBOF

DASD Options
The S10xx scale out servers provide up to 16 internal NVMe drives at Gen4 speeds. The NVMe drives
are connected to the processor via a plug-in PCIe NVMe JBOF (Just a Bunch Of Flash) card. Up to 2
JBOF cards can be populated in the S1024 and S1014 systems, with each JBOF card attached to an 8-
pack NVMe backplane. The S1022 NVMe backplane only supports the 4-pack, which provides up to 8
NVMe drive per system. The JBOF cards are operational in PCIe slots C8, C10 and C11 only. However,
the C8 and C10 slots cannot both be populated with JBOF cards simultaneously. As depicted in Figure
13 above, all 3 slots are connected to DCM0 which means a 1S system can have all the internal NVMe
drives available. While the NVMe drives are concurrently maintainable, a JBOF card is not. Unlike the
S9xx, the S10xx have no internal SAS drives. An external drawer can be used to provide SAS drives.

Power S10xx Service Processor


The S10xx class of system use the same eBMC service processor as the E1050 systems. Refer to the
Power E1050 Service Processor section of this document for more information.

POWER9 and Power10 processor-based systems RAS Page


27
IBM Power® E980
Introduction
The Power E980 class systems are designed to maximize availability in support of systems with 4, 8, 12
and 16 sockets.
The figure below, in a rough fashion only, gives a logical abstraction of the design and highlights some of
the changes from POWER8.

POWER9 and Power10 processor-based systems RAS Page


28
Figure 14: Power E980 System Structure

Concurrently replaceable and collectively provides


at least n+1 redundancy

Component collectively providing n+1 redundancy

Concurrently replaceable (some subcomponents shown)

System Control Unit Drawer FRU with sub-components shown

Subcomponents collectively providing at least n+1


VPD Card Op Panel Base redundancy
Sys Sys Spare sub-component within a FRU
Thermal Thermal Thermal LCD USB
VPD VPD Sensor Sensor Sensor
Fan Fan Fan Processor/memory component

I/O slot supporting concurrent repair


Other Sub-component in a concurrently replaceable FRU
Global Global
Service Service Other Components
Pwr Pwr
Battery

Battery
Processor Processor
RTC

RTC
In. In. Main Planar Board
Card Card
Other Interface Component

USB Power/ Power/ Power/


Local Clock/ Local Clock/
CARDPower Ctrl
Interface Local Interface Ctrl
Cables connect FSP/Clock and Power Interface
Local Clock/ Clock/ Power
Interface Ctrl Ctrl Interface
Between System Control Unit
And System Node Drawers Interface

VRM
Pwr VRM Local Local Hot Plug
Pwr Supply
VRM
VRM VRM

...
VRM … … Local
Service
Functions
Local
Service
Functions
Circuits
Fan
Supply … Service Service
AC Input Source 1 Functions Functions Fan
Pwr
CDIMM

CDIMM

CDIMM

CDIMM
CDIMM

CDIMM

CDIMM

CDIMM
NVMe-C1 NVMe-C3
Pwr Supply Fan
NVMe-C2 NVMe-C4
Supply
Scale-up Scale-up …
Scale-up Scale-up Fan
Proc 0 Proc 1 Proc 2 Proc 3 Fan
Pwr
CDIMM

CDIMM

CDIMM

CDIMM

CDIMM

CDIMM

CDIMM

CDIMM
PCIe Slot C1
PCIe Slot C2

PCIe Slot C7
PCIe Slot C8

PCIe Slot C3
PCIe Slot C4

PCIe Slot C5
PCIe Slot C6
Pwr Supply Fan Fan
Supply
AC Input Source 2 Pwr Fan
Pwr Supply Fan
Supply

System Node Drawer


System Node Drawer

Overview Compared to POWER8


The main building block of these systems is a 4 processor module CEC drawer supporting connection to
up to 32 custom memory DIMMs. The system contains 8 PCIe slots connecting the processors. These
slots support PCIe GEN4 as opposed to PCIe GEN3 in the POWER8 Systems.
New for POWER9, internal NVMe drives have been added as storage for limited use situations. These
are sold in pairs. Each CEC drawer may have zero, two or four drives.
Each system also contains a System Control Unit drawer. This drawer, powered from the first CEC
drawer (and second if present) has dual global flexible service procesor cards as well as system VPD and
an operator panel. Instead of containing an optional DVD drive, the System Control Drawer has a USB
port for conntecting an external USB drive.
Aside from the addition of the NVMe drives, the most obvious change is that the system control drawer no
longer has system clock cards. This is due to a fundamental design change in the POWER9 processor
where the POWER9 processor no longer requires that the processors have clock synchronization across
CEC drawers. Therefore each system drawer incorporates two clock cards, eliminating separate global
clock cards in the system control drawer.
The rest of this section will discuss the structural differences between these systems and POWER8, then
go into the RAS details about the rest of the design.
Across Node Fabric Bus RAS

POWER9 and Power10 processor-based systems RAS Page


29
The SMP interconnect fabric of a POWER processor is the logic used to communicate between
processors and components that share in the SMP cache coherency of the system.
Each fabric bus connecting a processor across CEC drawers is comprised of two links, each
corresponding to a SMP cable. Data is CRC checked and a retry mechanism handles the occasional soft
error.
Each link includes enough lanes so that each link can also be considered to have a spare data lane.
Because each bus is composed of two links, for some link failures, bus traffic can continue using only one
link (in a lane reduction or half bandwidth mode.) This mechanism can prevent certain outages that would
require a system reboot on a POWER8 system 1
Figure 15: POWER9 Two Drawer SMP Cable Connections

The design supports concurrent repair of the SMP cable for faults discovered during runtime that are
contained to the cable; for repairs performed before the system is rebooted.2
While the new fabric bus design for across node busses has a number of RAS advantages as described,
it does mean more cables to install for each bus compared to the POWER8 design. The POWER9 design
addressed this with a cable installation process using LEDs at the pull-tab cable connections to guide the
servicer through a cable plugging sequence,
Clocking
The POWER8 multi-node system required that each processor in the system use a single processor
reference clock source. The POWER8 processor could take in two different sources, providing for
redundancy, but each source had to be the same for each processor across all 4 CEC drawers.
This meant that the system control drawer contained a pair of “global” clock cards and the outputs of each
card had to be cabled to each CEC drawer in the system through a pair of local clock and control cards.

1
Running in half-bandwidth mode reduces performance not only for customer data but also hypervisor
and other firmware traffic across the bus. Because of this, there are limitations in the mid-November 2018
and earlier firmware as to which connections and how many busses can be at half-bandwidth without
causing a system impact.
2
Supported by at least firmware level 930. Concurrent repair is only for faults that are discovered during run-time and repaired
before system restart.

POWER9 and Power10 processor-based systems RAS Page


30
In POWER9 each Power E980 CEC drawer does not require a separate clock source. Consequently,
there are no global system clock cards in the system control drawer. Instead, each node has processor
clock logic added to the local clock and control cards. In this fashion the clocking continues to be
redundantly supplied to each processor while eliminating separate global clock cards and cables.
It should be noted that the local clock and control card also supplies redundant inputs for the processor
“multi-function” clock input used for such items as PCIe clocking. Failover to the redundant clock is
provided dynamically. Failover will be seen by adapter device drivers as PCIe EEH recoverable events.
System response to EEH recoverable events will depend on device driver and/or application-level
handling. Refer to the specifics of each adapter for details concerning EEH error handling/recovery
capabilities.
Internal I/O
As in the POWER8 design, each processor internally uses PHBs to provide x16 lanes for each of two
PCIe busses, for a total of 8 PCIe slots. These slots, however, support Gen4 instead of Gen3 PCIe.
In addition to supporting internal adapters, each of the Internal PCIe slots can also use a card that
provides optical connectivity to an external I/O drawer: currently the same Gen3 I/O drawer offered with
POWER8.
In addition, as previously described processors 2 and 3 use another host bridge to provide connectivity to
the NVMe drives.
Processor 1 uses an additional PHB to connect to a new USB card.
This card provides 3 USB connections. One of these is routed through a cable to the System Control
Drawer, re-driven to a connector on the Operator panel to allow for attachment of an external DVD drive.

Figure 16: I/O Connections

Processor Slots (Gen4 x16) Other


Proc0 P1-C1
P1-C2
Proc1 P1-C7 USB 3.0 Card
P1-C8
Proc2 P1-C3 NVMe Drive 1 and
Drive 2
P1-C4
Proc3 P1-C5 NVMe Drive 3 and
Drive 4
P1-C6

NVMe Drives
POWER8 Enterprise systems CEC drawers had no internal storage. This was primarily due to the
expectation that Enterprise customers would make use of a Storage Area Network (SAN) for both user
data availability as well as the storage used by VIOS which provided virtualized I/O redundancy.
As an alternative, external DASD drawer were also available in POWER8. However, feedback from a
number of customers expressed a preference for using for the VIOS root volume group even when a SAN
was deployed for everything else.
To accommodate this need, POWER9 Enterprise systems have the option of internal NVMe drives for
such purpose as VIOS rootvg storage. Each CEC drawer in the Power E980 system has 4 NVMe drive
slots. The drives are sold in pairs, with the expectation that the data on each drive will be mirrored
The drives are connected to processors two and three using PCIe busses from each as shown earlier.

POWER9 and Power10 processor-based systems RAS Page


31
To maximize availability, a drive and its redundant pair should be controlled by separate processors in
single drawer systems. In multi-drawer systems different configurations might be used to maintain
availability across multiple VIOS.
One such method of maximizing availability would have redundant VIOS partitions and each VIOS
partition have mirrored drives not controlled by the same processor.
The NVMe drives support concurrent maintenance where applicable from a device driver and OS
standpoint. They are connected to the CEC drawer backplane through a separate backplane and riser
card that are not concurrently maintained.
Processor and Memory RAS
Section three of this paper will describe the processor and memory RAS characteristics of this system.
Infrastructure RAS Enhancements
As illustrated earlier, like the POWER8 system, the POWER9 CEC drawers and system control drawers
have many infrastructure design redundancy features.
Regardless of the number of CEC drawers, service processor redundancy is provided. Each CEC drawer
(or node) now has redundant processor clocks; additional redundant service infrastructure support; an
alternate boot processor module and boot firmware module; and dual trusted platform module (TPM)
cards. The Power E980 is the only POWER9 system that has these additional components.
In addition, in the CEC drawers, the Power supplies provide at least n+1 power supply redundancy and
support line-cord redundancy. Fans provide at least n+1 redundancy as well. The power supplies and
fans also support concurrent maintenance.
I/O adapters support concurrent maintenance as well and can be made redundant using facilities such as
VIOS as previously discussed. The same concurrent maintenance capabilities for the I/O drawer and I/O
cables are unchanged from POWER8.
The system control drawer gets power from up to two CEC nodes, providing redundancy. The fans in the
system control drawer also support concurrent maintenance.
The Operator Panel design is new compared to POWER8. The Operator Panel is composed of two parts:
a base part, and a separate LED display. It also supports a USB connection. The full operator panel
assembly is still concurrently maintainable.
The FSP card has been somewhat re-designed; the most visible change being that the cables used to
connect the FSP to the CEC drawer are no longer on the FSP cards themselves.
This was done for serviceability reasons. An FSP card installation now incorporates LEDs as part of a
visual validation step. Batteries for the real-time clock used by each FSP is also still concurrently
maintainable.
Code changes have also been made in these systems that provide additional RAS benefits. In POWER8
the FSP was used for processor run-time diagnostics. These diagnostics have been moved to run on
system processors under hypervisor control.

POWER9 and Power10 processor-based systems RAS Page


32
Power® E950
Introduction
The Power E950 base system design is flexible to support two or four processors with multiple options for
internal storage.
From a RAS standpoint the system design is somewhat different from the POWER8 comparable system.
Two items stand out in that regard: greater integration eliminating a separate I/O planar and a new design
for supporting industry standard DIMMs while maintaining the advantages of the IBM memory buffer.
The figure below gives a rough idea of the system configuration and capabilities regarding redundancy
and concurrent maintenance.

Figure 17: Power E950 System Illustration (Four Socket System)

USB Ports DASD Backplane Op Panel


Base
N N N N
V V V V S S S S S S S S
Pwr Pwr Pwr Pwr M M M M A A A A A A A A LCD
Supply Supply Supply Supply e e e e S S S S S S S S

Memory Memory Memory Memory


VRM Riser Riser
Riser Riser Concurrently replaceable and
VRM collectively can provide at least n+1
……
… …… Scale-up
Processor
Scale-up
Processor
Scale-up
Processor
Scale-up
Processor
redundancy
Concurrently replaceable

Memory Memory Memory Memory Concurrently replaceable (some


Other
Riser Riser Riser Riser subcomponents shown)

Serv ice
Functions
FRU with sub-components shown
System Planar
Clock
TPM/ VPD Card Subcomponents collectively providing
Circuitry Service at least n+1 redundancy
Battery

Sys
RTC

PNOR Sys
VPD VPD Processor Processor/memory component

I/O slot supporting concurrent repair


Proc CP3 Proc CP2 Proc CP1 Proc CP0
Other Sub-component in a
Gen 4 Gen 4 Gen 4 Gen 4 Gen 4 Gen 4 concurrently replaceable FRU
x16 x16 x16 x8 x16 X8
Other Components
PCIe Slot C12
PCIe Slot C2
PCIe Slot C3

PCIe Slot C4
PCIe Slot C5

PCIe Slot C7
PCIe Slot C8
PCIe Slot C9

PCIe Slot C10


PCIe Slot C11

PCIe PCIe
Switch Switch Components integrated on planar
Module Module
Gen 3 Gen 3 Main Planar Board
x8 x4 Gen 3 Gen 3
x4 x1 Other Interface Component
To NVMe 1

To NVMe 3
To NVMe 2
To NVMe 4
PCIe Slot C6

USB
Ctrl

Fan FRU Fan FRU Fan FRU Fan FRU


Fan Fan Fan Fan Fan Fan Fan Fan
Rotor Rotor Rotor Rotor Rotor Rotor Rotor Rotor

The memory Risers in the illustration above can be better illustrated by the following:

POWER9 and Power10 processor-based systems RAS Page


33
Figure 18: Power E950 Memory Riser

ISDIMM Slot ISDIMM Slot


ISDIMM Slot ISDIMM Slot
ISDIMM Slot ISDIMM Slot
ISDIMM Slot ISDIMM Slot

Memory Memory
Buffer Buffer

Memory Memory
Buffer Buffer
ISDIMM Slot ISDIMM Slot
ISDIMM Slot ISDIMM Slot
ISDIMM Slot ISDIMM Slot
ISDIMM Slot ISDIMM Slot

The circled portion of the memory riser and DIMMs would be replaced by a single custom enterprise
DIMM (CDIMM) in the Power E980 design.
In addition to not being as densely packaged as the CDIMM design, there are some differences in RAS
capabilities with the E850 memory which will be discussed in the next section on processor and memory
RAS.
I/O
Figure indicates that there are multiple I/O adapter slots provided. Some are directly connected to the
processor and others connected through a PCIe switch. There is a connection for a USB 3.0 adapter.
Generally, I/O adapters are packaged in cassettes and are concurrently maintainable (given the right
software configuration.)
Like the Power E980, two processors provide connections to the NVMe drives (as illustrated). In addition,
the system provides connectivity for SAS DASD devices. These are connected using SAS adapters that
can be plugged into slots C12 and C9.

Figure 19: Power E950 I/O Slot Assignments

Slot (if any) Type From Supports


C2 GEN4 x16 Proc 3 PCIe Adapters, Cable card for I/O
expansion, CAPI 2.0,
C3 GEN4 x16 Proc 3 PCIe Adapters, Cable card for I/O
expansion, CAPI 2.0
C4 GEN4 x16 Proc 2 PCIe Adapters, Cable card for I/O
expansion, CAPI 2.0,
C5 GEN4 x16 Proc 2 PCIe Adapters, Cable card for I/O
expansion, CAPI 2.0,

POWER9 and Power10 processor-based systems RAS Page


34
C6 GEN3 x8 Proc 1 through a Switch Reserved for Ethernet Adapter
GEN3 x4 Proc 1 through a Switch NVMe Drive 2
GEN3 x4 Proc 1 through a Switch NVMe Drive 4
C7 GEN4 x16 Proc 1 PCIe Adapters, Cable card for I/O
expansion, CAPI 2.0,
C8 GEN4 x16 Proc 1 PCIe Adapters, Cable card for I/O
expansion, CAPI 2.0
C9 GEN4 x8 Proc 1 PCIe Adapters, SAS Adapter for
Internal SAS
C10 GEN4 x16 Proc 0 PCIe Adapters, Cable card for I/O
expansion, CAPI 2.0
C11 GEN4 x16 Proc 0 PCIe Adapters, Cable card for I/O
expansion, CAPI 2.0
GEN3 x4 Proc 0 through a Switch NVMe Drive 1
GEN3 x4 Proc 0 through a Switch NVMe Drive 3
C13 GEN3 x1 Proc 0 through a switch Internal USB 3.0 Card (Dedicated)
C10 GEN4 x16 Proc 0 PCIe Adapters, Cable card for I/O
expansion, CAPI 2.0
C12 GEN4 x8 Proc 0 PCIe Adapters, SAS Adapter for
Internal SAS
PCIe CAPI 2.0
The main illustration for the Power E950 does not show any PCIe CAPI 2.0 capabilities. The figure above
does indicate which slots support PCIe CAPI 2.0 adapters. It should be noted that CAPI adapters may not
be concurrently maintainable.
DASD Options
Like the Power E980, internal NVMe drives are provided for VIOS rootvg purposes. The Internal SAS
drives can be used for partition and other data.
There are two backplane options available for the SAS drives. The base option allows for connection by
one SAS adapter, or use of two SAS adapters with a split-backplane function.
Infrastructure
N+1 power supply redundancy and n+1 fan rotor redundancy are supported. Voltage regulator outputs for
voltages going to processors and to the memory riser are provided at n+1 level meaning there is one
more output phase provided than is needed.
New for the POWER9, the operator panel has a split design with a separate hot-pluggable LCD display.

POWER9 and Power10 processor-based systems RAS Page


35
IBM Power® S914, IBM Power® S922, IBM Power® S924
Introduction
The systems described in this section are meant-for scale-out applications. The POWER9 processor in
the POWER9 family used in these systems has features specifically for the 1s and 2s scale-out servers
with some changes from the processor used in the scale-up systems described previously. In addition to
not supporting multiple nodes, the most visible change is support for direct-attachment of industry
standard DIMMs rather than interfacing to memory through a buffer module.
The basic building block of a processor contains 2 cores each supporting 4 SMT threads. These two
cores share an L2 and an L3 cache. On these scale-out servers these cores may be “fused” together to
support a single 8 SMT thread fused core, or not fused to support up to 24 4 SMT thread cores.
System Features
System RAS feature highlights for the IBM Power System S914, IBM Power System S922, IBM Power
System H922, IBM Power System S924 and IBM Power System H924 servers are designed to run with
the IBM POWER Hypervisor™ and support AIX®, Linux™ and IBM i operating systems.
System feature highlights include:
• Power supply and fan cooling redundancy supporting concurrent maintenance
• Concurrent Maintenance for PCI adapters and internal disk-drives on the DASD
backplane
• Most System service can be performed without the need to take the server off the sliding
rails
• Direct Attach Industry Standard DIMMs as previously mentioned
Systems contain either one or two processor sockets with different tower and packaging options as well
as I/O and DASD capabilities.

POWER9 and Power10 processor-based systems RAS Page


36
Figure 20: Simplified View of 2 Socket Scale-Out Server

DASD Backplane

S S S S S S S S S S S S Op Panel
A A A A A A A A A A A A Base
Pwr Pwr Pwr Pwr
Supply Supply Supply Supply S S S S S S S S S S S S
LCD

IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM

IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
Concurrently replaceable and
collectively can provide at least n+1
VRM

redundancy
VRM Scale-out Scale-out Concurrently replaceable
Processor Processor
Concurrently replaceable (some
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM

IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
IS-DIMM
subcomponents shown)
Other


Serv ice FRU with sub-components shown
Functions

Subcomponents collectively providing


Clock
TPM/ VPD Card at least n+1 redundancy
Circuitry Service USB

Battery
Sys

RTC
PNOR Sys Processor
VPD VPD Ctrl Spare sub-component within a FRU

Processor/memory component
PCIe PCIe
Switch I/O slot supporting concurrent repair
Switch System Planar
Module 1 Module2 Other Components

Components integrated on a planar


PCIe Slot C2
PCIe Slot C3
PCIe Slot C4
PCIe Slot C5
PCIe Slot C6

PCIe Slot C7
PCIe Slot C8
PCIe Slot C9
PCIe Slot C10
PCIe Slot C11
PCIe Slot C12
RTC Batter
Processor and
Service

Adapter
NVMe M.2 Card or SAS

Adapter
NVMe M.2 Card or SAS

Gen 3
x8
Main Planar Board

Other Interface Component

Fan FRU Fan FRU Fan FRU


Fan Fan Fan Fan Fan Fan
Rotor Rotor Rotor Rotor Rotor Rotor

Note: There are differences in I/O connectivity between different models and there are multiple DASD
backplane options.
I/O Configurations
Like the Power E950, internal I/O slots have a mixture of slots directly attached to processors and those
connected through a switch.

Figure 21: Scale-out Systems I/O Slot Assignments for 1 and 2 Socket Systems

Slot (if Type In 1 Processor In 2 Processor Notes


any) Socket Systems Socket Systems
C2 Gen4 x8 with Not Present Proc 1
x16
connector

POWER9 and Power10 processor-based systems RAS Page


37
C3 Gen4 x16 Not Present Proc 1
C4 Gen4 x16 Not Present Proc 1
C5 Gen3 x8 or From Proc 0 From Proc 0 On 1S is a PCIe Adapter
USB through switch 0 through switch 0 connector on 2S provides a
Connector USB function (integrated)
C6 Gen3 x8 with From Proc 0 Proc 0 through a
x16 through switch 0 switch
connector
C7 Gen3 x8 From Proc 0 Proc 0 through a
through switch 0 switch
C8 Gen4 x8 with From Proc 0 Proc 0
x16
connector
C9 Gen4 x16 From Proc 0 Proc 0
C10 Gen3 x8 (1s) From Proc 0 From Proc 1 PCIe Adapters for 1s and
or SAS through switch 1 through switch 1 provides a SAS function
connector (integrated) on 2S
(2S)
C11 Gen3 x8 From Proc 0 From Proc 1 PCIe Adapters. Is default slot
through switch 1 through switch 1 for LAN on 2S systems
C12 Gen3 x8 with From Proc 0 From Proc 1 PCIe Adapters
x16 through switch 1 through switch 1
connector
C49 Gen3 x8 From Proc 0 Proc 0 through a M.2 NVMe Drive Carrier 2 or
through switch 1 switch 2nd SAS Adapter Card
C50 Gen3 x8 From Proc 0 Proc 0 through a M.2 NVMe Drive Carrier 1 or
through switch 0 switch 1st SAS Adapter Card
Integrated Gen4 x1 Proc 0 Proc 0 USB 3.0 4 DIMM (Integrated)
Other documentation should be referred to for details on I/O options including card profiles and power
requirements.
In considering redundancy it is important to understand that for certain faults within a processor, all the
I/O underneath the processor may not be accessible. Likewise, when a PCIe switch is also incorporated
into the path, certain switch faults can impact all the I/O underneath a switch.
In the above illustration specific slots are dedicated for SAS adapters. The C49 and C50 slots can be
used either for an NVME M.2 adapter or for SAS adapters. When used for SAS adapters these slots have
connections between them for communication in a dual SAS adapter environment.
For the internal SAS adapters there are options to use two SAS adapters with a single backplane or with
a split DASD backplane. Using two I/O adapters provides maximum redundancy.
Note also that certain slots do support a cable card used to connect the system to an external I/O drawer.
The I/O drawer RAS is discussed later.
Infrastructure and Concurrent Maintenance
Different models/configurations will have different fan options. In general, power supply and fan rotor
redundancy will be at least n+1 meaning that the system will continue to operate when one element fails.
The capability of concurrently replacing a power supply or fan element is also provided.

POWER9 and Power10 processor-based systems RAS Page


38
Additionally, generally speaking I/O adapters can be concurrently maintained from a system standpoint.
The need to halt the I/O on the component being maintained and/or make use of I/O redundancy is still
required. The SAS drives can also be maintained in a similar way.
Due to system requirements the SAS adapters, GPU cards, CAPI cards, and the NVMe M.2 carrier cards,
and the M.2 flash modules on the carrier cards are not concurrently maintainable.
In these systems the base component of the Operator Panel is standard on all configurations and is not
concurrently maintainable. Use of the hot-pluggable LCD display component is optional for some
configurations. Service requires there to be at least 1 LCD display per rack for rack-mounted systems.

POWER9 and Power10 processor-based systems RAS Page


39
POWER9: PCIe Gen3 I/O Expansion Drawer
PCIe Gen3 I/O Expansion Drawers can be used in systems to increase I/O capacity.

Figure 22: PCIe Gen3 I/O Drawer RAS Features

Optical Connections to CEC Drawer

x8 x8 x8 x8

Optics Interface Optics Interface

PCIe Switch PCIe Switch


Slot
Slot
Slot
Slot
Slot
Slot

Slot
Slot
Slot
Slot
Slot
Slot
I/O
I/O
I/O
I/O
I/O
I/O

I/O
I/O
I/O
I/O
I/O
I/O
I/O Module I/O Module

Mid-plane
Pwr/Temp
Ctrl
Fan Pwr
Fans Pwr
Gen3 I/O Drawer

These I/O drawers are attached using a connecting card called a PCIe3 cable adapter that plugs in to a
PCIe slot of the main server. In POWER9 these cable cards have been redesigned in certain areas to
improve error handling. These improvements include new routing for clock logic within the cable card as
well as additional recovery for faults during IPL.
The Power 10 cable card in the CEC module is also improved by using a switch instead of a re-timer. This
enables better fault isolation.
Each I/O drawer contains up to two I/O drawer modules. An I/O module uses 16 PCIe lanes controlled
from a processor in a system. Currently supported is an I/O module that uses a PCIe switch to supply six
PCIe slots.
Two different active optical cable are used to connect a PCIe3 cable adapter to the equivalent card in the
I/O drawer module. While these cables are not redundant, as of FW830 firmware or later; the loss of one
cable will simply reduce the I/O bandwidth (number of available lanes available to the I/O module) by
50%.
Infrastructure RAS features for the I/O drawer include redundant power supplies, fans, and DC outputs of
voltage regulators (phases).
The impact of the failure of an I/O drawer component can be summarized for most cases by the table
below.

POWER9 and Power10 processor-based systems RAS Page


40
Figure 23: PCIe Gen3 I/O Expansion Drawer RAS Feature Matrix

Faulty Component Impact of Failure Impact of Repair Prerequisites


I/O adapter in an I/O Loss of function of the I/O adapter can be Multipathing/ I/O adapter
slot I/O adapter repaired while rest of the redundancy, where
system continues to implemented, can be
operate. used to prevent
application outages
First Fault on a Data None: Spare Used No Repair Needed:
lane (in the optics Integrated Sparing
between the PCIe3 Feature
cable adapter card
in the system and
the I/O module)
A Second Data Lane System continues to Associated I/O module
fault or other failure run but the number of must be taken down for
of one active optics active lanes available repair; rest of the system
cable to I/O module will be can remain active.
reduced
Other Failure of Loss of access to all Associated I/O module Systems with an HMC
PCIe3 cable adapter the I/O of the must be taken down for
card In System or connected I/O Module repair; rest of the system
I/O module can remain active.
One Fan System continues to Concurrently repairable
run with remaining fan
One Power supply System continues to Concurrently repairable
run with remaining fan
Voltage Regulator System continues to Associated I/O module Systems with an HMC
Module associated run for a phase failure cannot be active during
with an I/O module transition to n mode. repair; rest of the system
Other faults will impact can remain active.
all the I/O in the
Module
Chassis No Impact to running I/O drawer must be Systems with an HMC
Management Card system, but once powered off to repair (loss
(CMC) powered off, the I/O of use of all I/O in the
drawer cannot be re- drawer)
integrated until CMC is
repaired.
Midplane Depends on source of I/O drawer must be Systems with an HMC
failure, may take down powered off to repair (loss
entire I/O drawer of use of all I/O in the
drawer)

POWER9 and Power10 processor-based systems RAS Page


41
Section 2: General RAS Philosophy and Architecture
Philosophy
In the previous section three different classes of servers were described with different levels of RAS
capabilities for POWER9 processor-based systems. While each server had specific attributes, the section
highlighted several common attributes.
This section will outline some of the design philosophies and characteristics that influenced the design.
The following section will detail more specifically how the RAS design was carried out in each sub-system
of the server.
Integrate System Design
The systems discussed use a processor architected and designed by IBM. IBM systems contain other
hardware content also designed by IBM including memory buffer components, service processors and so
forth.
Additional components not designed or manufactured by IBM are chosen and specified by IBM to meet
system requirements. These are procured for use by IBM using a rigorous procurement process intended
to deliver reliability and design quality expectations.
The systems that IBM design are manufactured by IBM to IBM’s quality standards.
The systems incorporate software layers (firmware) for error detection/fault isolation and support as well
as virtualization in a multi-partitioned environment.
These include IBM designed and developed service firmware. IBM’s PowerVM hypervisor is also IBM
designed and supported.
In addition, IBM offers two operating systems developed by IBM: AIX and IBM i. Both operating systems
come from a code base with a rich history of design for reliable operation.
IBM also provides middle-ware and application software that are widely used such as IBM WebSphere®
and DB2® pureScale™ as well as software used for multi-system clustering, such as various IBM
PowerHA™ SystemMirror™ offerings.
These components are designed with application availability in mind, including the software layers, which
are also capable of taking advantage of hardware features such as storage keys that enhance software
reliability.

POWER9 and Power10 processor-based systems RAS Page


42
Figure 24: IBM Enterprise System Stacks (POWER8 Design Point)

The figure above indicates how IBM design and influence may flow through the different layers of a
representative enterprise system as compared to other designs that might not have the same level of
control. Where IBM provides the primary design and manufacturing test criteria, IBM can be responsible
for integrating all the components into a coherently performing system and verifying the stack during
design verification testing.
In the end-user environment, IBM likewise becomes responsible for resolving problems that may occur
relative to design, performance, failing components and so forth, regardless of which elements are
involved.
Incorporate Experience
Being responsible for much of the system, IBM puts in place a rigorous structure to identify issues that
may occur in deployed systems and identify solutions for any pervasive issue. Having support for the
design and manufacture of many of these components, IBM is best positioned to fix the root cause of
problems, whether changes in design, manufacturing, service strategy, firmware or other code is needed.
The detailed knowledge of previous system performance has a major influence on future systems design.
This knowledge lets IBM invest in improving the discovered limitations of previous generations. Beyond
that, it also shows the value of existing RAS features. This knowledge justifies investing in what is
important and allows for adjustment to the design when certain techniques are shown to be no longer of
much importance in later technologies or where other mechanisms can be used to achieve the same
ends with less hardware overhead.

Architect for Error Reporting and Fault Isolation

It is not feasible to detect or isolate every possible fault or combination of faults that a server
might experience, though it is important to invest in error detection and build a coherent architecture for
how errors are reported and faults isolated. The sub-section on processor and memory error detection
and fault isolation details the IBM Power approach for these system elements.
It should be pointed out error detection may seem like a well understood and universal hardware

POWER9 and Power10 processor-based systems RAS Page


43
design goal. However, it is not always the goal of every computer sub-system design. Hypothetically, for
instance, graphics processing units (GPU s) whose primary purpose is rendering graphics in non-critical
applications may have options for turning off certain error checking (such as ECC in memory) to allow for
better performance. The expectation in such case is that there are applications where a single dropped
pixel on a screen is of no real importance, and a solid fault is only an issue if it is noticed.

In general, I/O adapters may also have less hardware error detection capability where they can rely on a
software protocol to detect and recover from faults when such protocols are used.

Leverage Technology and Design for Soft Error Management


In a very real sense errors detected can take several forms. The most obvious is a functional fault in the
hardware – a silicon defect, or a worn component that over time has failed.
Another kind of failure is what is broadly classified as a soft error. Soft errors are faults that occur in a
system and are either occasional events inherent in the design or temporary faults that are due to an
external cause.
Data cells in caches and memory, for example, may have a bit-value temporarily upset by an external
event such as caused by a cosmic ray generated particle. Logic in processors cores can also be subject
to soft errors where a latch may also flip due to a particle strike or similar event. Busses transmitting data
may experience soft errors due to clock drift or electronic noise.
The susceptibility to soft errors in a processor or memory subsystem is very much dependent on the
design and technology used in these devices. This should be the first line of defense. Choosing latches
that are less susceptible to upsets due to cosmic ray events was discussed extensively in previous
whitepapers for example.
Methods for interleaving data so that two adjacent bits in array flipping won’t cause undetected multi-bit
flips in a data word is another design technique that might be used.
Ultimately when data is critical, detecting soft error events that occur needs to be done immediately, in-
line to avoid relying on bad data because periodic diagnostics is insufficient to catch an intermittent
problem before damage is done.
The simplest approach to detecting many soft error events may simply be having parity protection on data
which can detect a single bit flip. When such simple single bit error detection is deployed, however, the
impact of a single bit upset is bad data. Discovering bad data without being able to correct it will result in
termination of an application, or even a system so long as data correctness is important.
To prevent such a soft error from having a system impact it is necessary not simply to detect a bit flip, but
also to correct. This requires more hardware than simple parity. It has become common now to deploy a
bit correcting error correction code (ECC) in caches that can contain modified data. Because such flips
can occur in more than just caches, however, such ECC codes are widely deployed in POWER9
processors in critical areas on busses, caches and so forth.
Protecting a processor from more than just data errors requires more than just ECC checking and
correction. CRC checking with a retry capability is used on a number of busses, for example.
When a fault is noticed maximum protection is achieved when not only is a fault noticed but noticed
quickly enough to allow processor operations to be retried. Where retry is successful, as would be
expected for temporary events, system operation continues without application outages.

Deploy Strategic Spare Capacity to Self-Heal Hardware


Techniques that protect against soft errors are of limited protection against solid faults due to a real
hardware failure. A single bit error in a cache, for example can be continually corrected by most ECC
codes that allow double-bit detection and single bit correction.

POWER9 and Power10 processor-based systems RAS Page


44
However, if a solid fault is continually being corrected, the second fault that occurs will typically cause
data that is not correctable. This would result in the need to terminate, at least, whatever is using the
data.
In many system designs, when a solid fault occurs in something like a processor cache, the management
software on the system (the hypervisor or OS) may be signaled to migrate the failing hardware off the
system.
This is called predictive deallocation. Successful predictive deallocation may allow for the system to
continue to operate without an outage. To restore full capacity to the system, however, the failed
component still needs to be replaced, resulting in a service action.
Within Power, the general philosophy is to go beyond simple predictive deallocation by incorporating
strategic sparing or micro-level deallocation of components so that when a hard failure that impacts only a
portion of the sub-system occurs, full error detection capabilities can be restored without the need to
replace the part that failed.
Examples include a spare data lane on a bus, a spare bit-line in a cache, having caches split up into
multiple small sections that can be deallocated, or a spare DRAM module on a DIMM.
With the general category of self-healing can be the use of spares. Redundancy can also be deployed to
avoid outages. The concepts are related but there are differences between redundancy and use of spares
in the Power approach.

Redundant Definition
Redundancy is generally a means of continuing operation in the presence of certain faults by providing
more components/capacity than is needed to avoid outages but where a service action will be taken to
replace the failed component after a fault.

Sometimes redundant components are not actively in use unless a failure occurs. For example, a
processor may only actively use one clock source at a time even when redundant clock sources are
provided.

In contrast, fans and power supplies are typically all active in a system. If a system is said to have “n+1”
fan redundancy, for example, all “n+1” fans will normally be active in a system absence a failure. If a fan
fails occurs, the system will run with “n” fans. In cases where there are fans or power supply failures,
power and thermal management code may compensate by increasing fan speed or making other
adjustments according to operating conditions per power management mode and power/thermal
management policy.

Spare Definition
A spare component is similar in nature though when a “spare is successfully used” the system can
continue to operate without the need to replace the component.
As an example, for voltage regulator output modules, if five output phases are needed to maintain the
power needed at the given voltage level, seven could be deployed initially. It would take the failure of
three phases to cause an outage.
If on the first phase failure, the system continues to operate, and no call out is made for repair, the first
failing phase would be considered spare. After the failure (spare is said to be used), the VRM could
experience another phase failure with no outage. This maintains the required n+1 redundancy. Should a
second phase fail, a “redundant” phase would then have been said to fail and a call-out for repair would
be made.

POWER9 and Power10 processor-based systems RAS Page


45
Focus on OS Independence
Because Power Systems have long been designed to support multiple operating systems, the hardware
RAS design is intended to allow the hardware to take care of the hardware largely independent of any
operating system involvement in the error detection or fault isolation (excluding I/O adapters and devices
for the moment.)
To a significant degree this error handling is contained within the processor hardware itself. However,
service diagnostics firmware, depending on the error, may aid in the recovery. When fully virtualized,
specific OS involvement in such tasks as migrating off a predictively failed component can also be
performed transparent to the OS.
The PowerVM hypervisor is capable of creating logical partitions with virtualized processor and memory
resources. When these resources are virtualized by the hypervisor, the hypervisor has the capability of
deallocating fractional resources from each partition when necessary to remove a component such a
processor core or logical memory block (LMB).
When an I/O device is directly under the control of the OS, the error handling of the device is the device
driver responsibility. However, I/O can be virtualized through the VIOS offering meaning that I/O
redundancy can be achieved independent of the OS.

Build System Level RAS Rather Than Just Processor and Memory RAS
IBM builds Power systems with the understanding that every item that can fail in a system is a potential
source of outage.
While building a strong base of availability for the computational elements such as the processors and
memory is important, it is hardly sufficient to achieve application availability.
The failure of a fan, a power supply, a voltage regulator, or I/O adapter might be more likely than the
failure of a processor module designed and manufactured for reliability.
Scale-out servers will maintain redundancy in the power and cooling subsystems to avoid system outages
due to common failures in those areas. Concurrent repair of these components is also provided.
For the Enterprise system, a higher investment in redundancy is made. The Power E980 system, for
example is designed from the start with the expectation that the system must be largely shielded from the
failure of these other components causing persistent system unavailability; incorporating substantial
redundancy within the service infrastructure (such as redundant service processors, redundant processor
boot images, and so forth.) An emphasis on the reliability of components themselves are highly reliable
and meant to last.
This level of RAS investment extends beyond what is expected and often what is seen in other server
designs. For example, at the system level such selective sparing may include such elements as a spare
voltage phase within a voltage regulator module.

Error Reporting and Handling


First Failure Data Capture Architecture
POWER processor-based systems are designed to handle multiple software environments including a
variety of operating systems. This motivates a design where the reliability and response to faults is not
relegated to an operating system.
Further, the error detection and fault isolation capabilities are intended to enable retry and other
mechanisms to avoid outages due to soft errors and to allow for use of self-healing features. This requires
a very detailed approach to error detection.
This approach is beneficial to systems as they are deployed by end-users, but also has benefits in the
design, simulation, and manufacturing test of systems as well.

POWER9 and Power10 processor-based systems RAS Page


46
Putting this level of RAS into the hardware cannot be an after-thought. It must be integral to the design
from the beginning, as part of an overall system architecture for managing errors. Therefore, during the
architecture and design of a processor, IBM places a considerable emphasis on developing structures
within it specifically for error detection and fault isolation.
Each subsystem in the processor hardware has registers devoted to collecting and reporting fault
information as they occur.
The exact number of checkers and type of mechanisms isn’t as important as is the point that the
processor is designed for very detailed error checking; much more than is required simply to report during
run-time that a fault has occurred.
All these errors feed a data reporting structure within the processor. There are registers that collect the
error information. When an error occurs, that event typically results in the generation of an interrupt.
The error detection and fault isolation capabilities maximize the ability to categorize errors by severity and
handle faults with the minimum impact possible. Such a structure for error handling can be abstractly
illustrated by the figure below and is discussed throughout the rest of this section.

Figure 25: Handled Errors Classified by Severity and Service Actions Required

No Action Needed Recoverable By Hardware Uncorrectable Predictive


- Fault is Masked • Cache CE events • E.g. UE In cache
From Reporting • CRC errors Etc • Predictive callout immediately with self-healing actions if possible
- No error Log • Processor Runtime • If data is used, whatever is using the data must terminate (limited
Diagnostics signaled for impact Uncorrectable)
error reporting

Limited Impact Uncorrectable


• Whatever is using the resource with the faulty error must
Recoverable By terminate
Information Only Hypervisor • (Can be limited to partition for some processor erors or even an
- Information log • Certain data lookup application for data UE)
generated for array errors • For I/O errors may just mean ending use of I/O and or
development • Hypervisor Manages applications using it (I/O redundancy may limit impact)
use only Soft Errors • If impacts hypervisor, hypervisor must terminate (See below)
- Typically after a • Single Incidents are
threshold has Informational Logs
been reached • Repeated errors may
Hypervisor Handling Hypervisor Impacting Events
trigger Threshold
Callout of More Severe • If the hypervisor sees an error on a resource it owns, that
can’t be corrected
Within a POWER9 Actions
• Hypervisor terminates
system a single • Resource garded. Where possible, system reboots
Hypervisor
manages all logical System Checkstop Events
Callout Recoverable
Partitions Over Threshold • Hardware terminates immediately when error is such
• Serviceable Event that it is known that hypervisor can’t recover or
There is no Logged contain based on the class of error
physical • Self healing or
partitioning predictive deallocation
Call Home On First Instance
used
With a single • And Component Guarded if possible
system.

Processor Runtime Diagnostics


In previous system designs, the dedicated server processor ran code, referred to here as Processor
Runtime Diagnostics (PRD) which would access this information and direct the error management.
Ideally this code primarily handles recoverable errors including orchestrating the implementation of
certain “self-healing” features such as use of spare DRAM modules in memory, purging and deleting
cache lines, using spare processor fabric bus lanes, and so forth.

POWER9 and Power10 processor-based systems RAS Page


47
Code within a hypervisor does have control over certain system virtualized functions, particularly as it
relates to I/O including the PCIe controller and certain shared processor accelerators. Generally, errors in
these areas are signaled to the hypervisor.
In addition, there is still a reporting mechanism for what amounts to the more traditional machine-check or
checkstop handling.
In a POWER7 generation system the PRD was said to run and manage most errors whether the fault
occurred at run-time, or at system IPL time, or after a system-checkstop – which is the descriptive term
for entire system termination by the hardware due to a detected error.
In POWER8 the processor module included a Self-Boot-Engine (SBE) which loaded code on the
processors intended to bring the system up to the point where the hypervisor could be initiated. Certain
faults in early steps of that IPL process were managed by this code and PRD would run as host code as
part of the boot process.
In POWER9 process-based systems, during normal operation the PRD code is run in a special service
partition in the system on the POWER9 processors using the hypervisor to manage the partition. This has
the advantage in systems with a single service processor of allowing the PRD code to run during normal
system operation even if the service processor is faulty.

Service Diagnostics in the POWER9 Design


In a POWER7 generation server, PRD and other service code was all run within the dedicated service
processor used to manage these systems. The dedicated service processor was in charge of the IPL
process used to initialize the hardware and bring the servers up to the state where the hypervisor could
begin to run. The dedicated service processor was also in charge, as previously described, to run the
PRD code during normal operation.
In the rare event that a system outage resulted from a problem, the service processor had access not
only to the basic error information stating what kind of fault occurred, but also access to considerable
information about the state of the system hardware – the arrays and data structures that represent the
state of each processing unit in the system, and additional debug and trace arrays that could be used to
further understand the root cause of faults.
Even if a severe fault caused system termination, this access provided the means for the service
processor to determine the root cause of the problem, deallocate the failed components, and allow the
system to restart with failed components removed from the configuration.
POWER8 gained a Self-Boot-Engine which allowed processors to run code and boot using the POWER8
processors themselves to speed up the process and provide for parallelization across multiple nodes in
the high-end system. During the initial stages of the IPL process, the boot engine code itself handled
certain errors and the PRD code ran as an application as later stages if necessary.
In POWER9 the design has changed further so that during normal operation the PRD code itself runs in a
special hypervisor-partition under the management of the hypervisor. This has the advantage of
continuing to allow the PRD code to run even if the service processor is non-functional (important in non-
redundant environments.)
Should a fault the code running fail the hypervisor can restart the partition (reloading and restarting the
PRD.)
The system service processors are also still monitored at run-time by the hypervisor code and can report
errors if the service processors are not communicating.

PowerVM Partitioning and Outages


The PowerVM hypervisor provides logical partitioning allowing multiple instances of an operating system
to run in a server. At a high level, a server with PowerVM runs with a single copy of the PowerVM
hypervisor regardless of the number of CEC nodes or partitions.

POWER9 and Power10 processor-based systems RAS Page


48
The PowerVM hypervisor uses a distributed model across the server’s processor and memory resources.
In this approach some individual hypervisor code threads may be started and terminated as needed when
a hypervisor resource is required. Ideally when a partition needs to access a hypervisor resource, a core
that was running the partition will then run a hypervisor code thread.
Certain faults that might impact a PowerVM thread will result a system outage if they should occur. This
can be by PowerVM termination or by the hardware determining that for, PowerVM integrity, the system
will need to checkstop.
The design cannot be viewed as a physical partitioning approach. There are not multiple independent
PowerVM hypervisors running in a system. If for fault isolation purposes, it is desired to have multiple
instances of PowerVM and hence multiple physical partitions, separate systems can be used.
Not designing a single system to have multiple physical partitions reflects the belief that the best
availability can be achieved if each physical partition runs in completely separate hardware. Otherwise,
there is a concern that when resources for separate physical partitions come together in a system, even
with redundancy, there can be some common access point and the possibility of a “common mode” fault
that impacts the entire system.

POWER9 and Power10 processor-based systems RAS Page


49
Section 3: POWER9 Subsystems RAS Details
Processor RAS Details.
It is worth noting that the functions integrated within a processor has changed much over time. One point
of comparison could be a POWER6 processor compared to the current POWER9.

Figure 26: POWER6 Processor Design Compared to POWER9


X Fabric Bus X Fabric Bus X Fabric Bus

SMT8 C L L SMT8 C L L SMT8 C L L


M 2 3 M 2 3 M 2 3
Core E Core E Core E
* * * * * *
SMT8 C L L SMT8 C L L SMT8 C L L
M 2 3 M 2 3 M 2 3
Core E Core E Core E
* * * * * *
TP/NMMU/Int/MCD/VAS Memory
Memory
NX Unit etc. Controller/
Controller/
POWER6 POWER6 DMI Interface DMI Interface
Internal Fabric/Interfaces
SMT2 Core SMT2 Core

SMT8 C L L SMT8 C L L SMT8 C L L


4MB 4MB
L3 External
M 2 3 M 2 3 M 2 3
L2 L2 Core E Core E Core E
Cache
/Directory
Cache
/Directory
Controller
L3 Cache * * * * * *
SMT8 C L L SMT8 C L L SMT8 C L L
Fabric Bus Memory M 2 3 M 2 3 M 2 3
Other Processors Port Buffered DIMM
Other Nodes Controller(s) Controller Core E Core E Core E
Port Buffered DIMM * * * * * *
Port On Chip Self Boot
GX+ Bus Buffered DIMM PCIe Interface
Controller Engine
Controller Port Buffered DIMM O Fabric O Fabric O Fabric
Bus Bus Bus PHB PHB PHB CAPI CAPI NV NV

The illustration above shows a rough view of the POWER9 scale-up processor design leveraging SMT8
cores (a maximum of 12 cores shown.)
The POWER9 design is certainly more capable. There is a maximum of 12 SMT8 cores compared to 2
SMT2 cores. The core designs architecturally have advanced in function as well. The number of memory
controllers has doubled, and the memory controller design is also different.
The addition of system-wide functions such as the NX accelerators and the CAPI and NVLINK interfaces
provide functions just not present in the hardware of the POWER6 system.
The POWER9 design is also much more integrated. The L3 cache is internal, and the I/O processor host
bridge is integrated into the processor. The thermal management is now conducted internally using the
on-chip controller.
There are reliability advantages to the integration. It should be noted that when a failure does occur it
more likely to be a processor module at fault compared to previous generations with less integration.
Hierarchy of Error Avoidance and Handling
In general, there is a hierarchy of techniques used in POWER™ processors to avoid or mitigate the
impact of hardware errors. At the lowest level in the hierarchy is the design for error detection and fault
isolation, the technology employed, specifically as relates to reducing the instances of soft error not only
through error correction, but in the selection of devices within the processor IC.
Because of the extensive amount of functionality beyond just processor cores and caches, listing all the
RAS capabilities of the various system elements would require considerable detail. In somewhat broader
strokes, the tables below discuss the major capabilities in each area.

Figure 27: Error Handling Methods Highlights

Cache Error Handling The L2 and L3 caches in the processor use an ECC code that allows for

POWER9 and Power10 processor-based systems RAS Page


50
single bit correction.
During operation, when a persistent correctable error occurs in these
caches, the system has the capability of purging the data in the cache
(writing to another level of the hierarchy) and deleting it (meaning
subsequent cache operations won’t use the cache line during operation.)
This is an example of “self-healing” that avoids taking a planned outage
for a correctable error.
The L1 cache (usually thought of as part of the processor Core element)
checks for single bit errors. But instead of using an error correcting code,
intermittent L1 cache errors can be corrected using data from elsewhere
in the cache hierarchy.
A portion of an L1 cache can be disabled (set delete) to avoid outages
due to persistent hard errors.
If too many errors are observed across multiple sets the core using the L1
can be predictively deallocated.
Where the IBM memory buffer is used, the L4 cache also has an ECC
code that allows for single bit error correction.
In addition to the system caches as described above, there are the cache
directories which provide indexing to the caches. These also have single
bit error correction. Uncorrectable directory errors will typically result in
system checkstops as discussed below.
Other ECC Checking Beyond these elements, single bit correcting ECC is used in multiple
areas of the processor as the standard means of protecting data against
single bit upsets (beyond the reliability design features previously
mentioned.)
This includes a number of the internal busses where data is passed
between units.
Special Uncorrectable Where there is ECC in the path for data and an uncorrectable error is
Error Handling encountered, the desire to prevent reliance on bad data means that
whatever is using the data will need to be terminated.
In simpler system designs, that would mean termination of something, at
least the owner of the data, as soon as the uncorrectable error is
encountered.
POWER has long had the concept of marking the data with a special
ECC code and watching for when and if the data is going to be
“consumed” by the owner (if the data is ever used.)
At that point whatever the data owner is can be terminated.
This can be a single application if a processor running under AIX is
consuming user data. For kernel data, the OS kernel may be terminated.
If PowerVM attempts to use the data in a critical error, then PowerVM will
terminate.
One additional advantage of the special error correction code is that the
hardware is able to distinguish between a fresh ECC error when data is
transferred and one that has been passed along. This allows the correct
component, the one originating the fault, to be called out as the
component to be replaced.
Bus CRC As previously mentioned, ECC is used internally in various data-paths as
data is transmitted between units.
Externally to the processor, however, high speed data busses can be

POWER9 and Power10 processor-based systems RAS Page


51
susceptible to the occasional multiple bit errors due to the nature of the
bus design.
A cyclic redundancy check code can be used to determine if there are
errors within an entire packet of data.
If the fault is due to natural changing of bus characteristics over time
(“drift”) the bus can re-train, retry the operation and continue.
CRC checking is done for the memory bus.
New in POWER9: CRC checking is now done for the processor fabric
bus interfaces sending data between processors.
Lane Repair By itself CRC checking has the advantage of handling multiple bit errors.
For persistent problems it is in certain ways inferior to bit correcting ECC
if a single persistent error cannot be corrected by retry.
The memory bus between processors and the memory uses CRC with
retry. The design also includes a spare data lane so that if a persistent
single data error exists the faulty bit can be “self-healed.”
The POWER9 busses between processors also have a spare data lane
that can be substituted for a failing one to “self-heal” the single bit errors.
Split Data Bus In busses going between processor nodes (between CEC drawers in a
Power E980 system) if there is a persistent error that is confined to the
(½ bandwidth mode)
data on a single cable (and the bus is split between cables) POWER9
can reduce the bandwidth of the bus and send data across just the
remaining cable. This allows for correction of a more systematic fault
across the bus.
In addition, the bus between processors within a node or single CEC
drawer system is also capable of a split-bus or ½ bandwidth mode.
Full support of SMP bus features may depend on firmware level.
Processor Instruction Within the logic and storage elements commonly known as the “core”
Retry there can be faults other than errors in the L1 mentioned above.
Some of these faults may also be transient in nature. The error detection
logic within the core elements is designed extensively enough to
determine the root cause of a number of such errors and catch the fault
before an instruction using the facility is completed.
In such cases processor instruction retry capabilities within the processor
core may simply retry the failed operation and continue.
Since this feature was introduced in the POWER6, processor field data
has shown many instances where this feature alone had prevented
serious outages associated with intermittent faults.
In POWER9 faults that are solid in nature and where retry does not solve
the problem will be handled as described further down in this table.
Predictive Deallocation Because of the amount of self-healing incorporated in these systems as
well as the extensive error recovery it is rare that an entire processor core
needs to be predictively deallocated due to a persistent recoverable error.
If such cases do occur, PowerVM can invoke a process for deallocating
the failing processor dynamically at run-time (save in the rare case that
the OS application doesn’t allow for it.)
Core Checkstops On scale-up systems where the use of many partitions may be common,
if a fault cannot be contained by any of the previous features defined in
the hierarchy, it may be still beneficial to contain an outage to just the

POWER9 and Power10 processor-based systems RAS Page


52
partitions running the threads when the uncorrectable fault occurred.
The Core Checkstop feature (sometimes called a core-contained
checkstop) may do this in these systems supporting the feature for faults
that can be contained that way. This allows the outage associated with
the fault to be contained to just the partition(s) using the core at the time
of the failure if the core is only used at the time by partition(s) other than
in hypervisor mode.
It should be noted that in such an environment system performance may
be impacted for the remaining cores after the core contained checkstop.
Note: A core checkstop signaled for a fault occurring on a core with any
thread running a hypervisor instruction will typically result in hypervisor
termination as described below. The possibility of encountering
hypervisor running on a thread may have a correlation to the number of
threads in use.
PowerVM Handled errors There are other faults that are handled by the hypervisor that relate to
errors in architected features – for example handling a SLB multi-hit error.
Since the required error handling is documented for any hypervisor,
details can be found in the POWER9 Processor User’s Manual
maintained by the OpenPOWER™ Foundation.
https://openpowerfoundation.org/?resource_lib=power9-processor-users-
manual.
PCIe Hub Behavior and Each processor has three elements called PCIE hubs which generate the
Enhanced Error Handling various PCIe Gen4 busses used in the system. The hub is capable of
“freezing” operations when certain faults occur, and in certain cases can
retry and recover from the fault condition.
This hub freeze behavior prevents faulty data from being written out
through the I/O hub system and prevents reliance on faulty data within
the processor complex when certain errors are detected.
Along with this hub freeze behavior is what has long been termed as
Enhanced Error Handling for I/O.
This capability signals device drivers when various PCIe bus related
faults occur. Device drivers may also attempt to restart the adapter after
such faults (EEH recovery) depending on the adapter, device driver and
application.
A clock error in the PCIe clocking can be signaled and managed using
EEH in any system that incorporates redundant PCIe clocks with dynamic
clock failover enabled.
Hypervisor Termination If a fault is contained to a core, but the core is running PowerVM code,
and System Checkstops PowerVM may terminate to maintain the integrity of the computation of
the partitions running under it.
In addition, each processor fault checked is reviewed during design. If it is
known in advance that a particular fault can impact the hypervisor or if
there is a fault in a processor facility whose impact cannot be localized by
other techniques, the hardware will generate a platform checkstop to
cause system termination.
This by design allows for the most efficient recovery from such errors and
invokes the full error determination capabilities of the service processor.
In scale-out systems where core-contained checkstop is not applicable,
hypervisor termination or system checkstops would be expected for
events managed as core-contained checkstops in scale-up systems.

POWER9 and Power10 processor-based systems RAS Page


53
Memory Channel The memory controller communicating between the processor and the
Checkstops and memory memory buffer has its own set of methods for containing errors or retrying
mirroring operations.
Some severe faults require that memory under a portion of the controller
become inaccessible to prevent reliance on incorrect data. There are
cases where the fault can be limited to just one memory channel.
In these cases, the memory controller asserts what is known as a
channel checkstop. In systems without hypervisor memory mirroring, a
channel checkstop will usually result in a system outage. However, with
hypervisor memory mirroring the hypervisor may continue to operate in
the face of a memory channel checkstop.
Processor Safe-mode Failures of certain processor wide facilities such as power and thermal
management code running on the on-chip-sequencer and the self-boot-
engine used during run-time for out of band service processor functions
can occur. To protect system function, such faults can result in the
system running in a “safe-mode” which allows processing to continue with
reduced performance in the face of errors where the ability to access
power and thermal performance or error data is not available.
Persistent Guarding of Should a fault in a processor core become serious enough that the
Failed Elements component needs to be repaired (persistent correctable error with all self-
healing capabilities exhausted or an unrecoverable error), the system will
remember that a repair has been called for and not re-use the processor
on subsequent system reboots, until repair.
This functionality may be extended to other processor elements and to
entire processor modules when relying on that element in the future
means risking another outage.
In a multi-node system, deconfiguration of a processor node for fabric bus
consistency reasons results in deconfiguration of a node.
As systems design continues to mature the RAS features may be adjusted based on the needs of the
newer designs as well as field experience; therefore, this list differs from the POWER8 design.
For example, in POWER7+, an L3 cache column repair mechanism was implemented to be used in
addition to the ability to delete rows of a cache line.
This feature was carried forward into the POWER8 design, but the field experience in the POWER8
technology did not show the benefit based on how faults that might implement multiple rows surfaced. For
POWER9 enterprise systems, given the size of the caches, the number of lines that were deleted was
extended instead of continuing to carry column repair as a separate procedure.
Going forward, the number of rows that need to be deleted is adjusted for each generation based on
technology.
Likewise, in POWER6 a feature known as alternate processor recovery was introduced. The feature had
the purpose of handling certain faults in the POWER6 core. The faults handled were limited to certain
faults where the fault was discovered before an instruction was committed and the state of the failing core
could be extracted. The feature, in those cases, allowed the failing workload to be dynamically retried on
an alternate processor core. In cases where no alternate processor core was available, some number of
partitions would need to be terminated, but the user was allowed to prioritize which partitions would be
terminated to keep the most important partition running.
The mixture of which parts could be handled by this kind of design, and the method of delivering the
design itself, changed from generation to generation. In POWER8, field observations showed that
essentially, and in part due to advances in other error handling mechanisms, the faults where alternate
processor recovery was required and viable had become exceedingly rare. Therefore, the function is not
being carried over in POWER9.

POWER9 and Power10 processor-based systems RAS Page


54
Processor Module Design and Test
While Power Systems are equipped to deal with certain soft errors as well as random occasional
hardware failures, manufacturing weaknesses and defects should be discovered and dealt with before
systems are shipped. So before discussing error handling in deployed systems, how manufacturing
defects are avoided in the first place, and how error detection and fault isolation is validated will be
discussed.
Again, IBM places a considerable emphasis on developing structures within the processor design
specifically for error detection and fault isolation.
The design anticipates that not only should errors be checked, but also that the detection and reporting
methods associated with each error type also need to be verified. When there is an error class that can
be checked, and some sort of recovery or repair action initiated, there may be a hardware method to
“inject” an error that will test the functioning of the hardware detection and firmware capabilities. Such
error injecting can include different patterns of errors (solid faults, single events, and intermittent but
repeatable faults.) Where direct injection is not provided, there should be a way to at least simulate the
report that an error has been detected and test response to such error reports.
Under IBM control, assembled systems are also tested and a certain amount of system “burn-in” may also
be performed, doing accelerated testing of the whole system to weed-out weak parts that otherwise might
fail during early system life, and using the error reporting structure to identify and eliminate the faults.
In that mode the criteria may be more severe as to what constitutes a failure. A single fault, even if it’s
recoverable, might be enough to fail a part during this system manufacturing process.

POWER9 and Power10 processor-based systems RAS Page


55
POWER 9 Memory RAS
Memory RAS Fundamentals

Model Comparative Memory RAS


Unlike the previous generation, the different PowerVM based POWER9 systems have different
approaches to the memory subsystem.
These are summarized roughly in the figure below:

Figure 28 : Memory Design Points

POWER9 Memory Options

P9 P9 Processor P9 Processor
Scaleout Processor

Riser Card
Memory Memory Memory Memory
Memory 2 Rank DIMM Supporting 2
Buffer Buffer Buffer Buffer
Buffer
CDIMM 128 Bit ECC word DRAM Groups
ISDIMM
ISDIMM
ISDIMM
ISDIMM ISDIMM
ISDIMM
ISDIMM
ISDIMM

1-2 Socket Scale-out Systems E950 4 Socket System E980 Systems


Ø Leverages new direct attach Ø With x4 DIMMs Ø Supports x4 and x8 DIMMs
memory approach with no buffer Ø Across a DIMM pair Ø Enterprise level of error
chip Ø x4 Chipkill correction correction for x4 and x8 DIMMs
Ø Plus 1 spare DRAM module
Ø Allows for Chipkill correction using Ø And Row Repair* Ø Spare DRAM Modules on
x4 DRAMS Ø Will only ship x4 DIMMs Custom DIMM even when x8
Ø Improved over P7 with ISDIMMS by DIMMs are used
correcting 2 instead of 1 random Ø Riser Card
symbols in the absence of chipkill Ø Separate FRU used for memory Ø Highest DIMM integration means
buffers fewer failures compared to the
Ø No Sparing E950 ISDIMM design (details in
the reliability estimates)
* For some DIMMs with FW released on 11/16/20 or later

Memory Design Basics


To fully understand the reasons for the differences some background is required.
In many servers the basic building block of main system memory is Dynamic Random Access Memory
integrated circuit modules (DRAMs).
In a very simplified sense, a DRAM module can be thought of as an array of bits many rows deep but
typically either 4 or 8 columns wide. These can be referred to as x4 or x8 DRAMs.
Processors read main memory in units called a memory cache line. Many non-POWER systems have a
memory cache line that is 64 bytes. These cache lines are typically read in bursts of activity a memory
“word” at a time where a memory word would be 64 bits plus additional check bits allowing for some level
of error correction.

POWER9 and Power10 processor-based systems RAS Page


56
Figure 29: Simplified General Memory Subsystem Layout for 64 Byte Processor Cache Line

Processor

Cache

Memory Controller

Memory Bus
Driver/Receiver

64 Byte Cache line (512 Bits)


Requires 8 ECC words from DIMM
Accessed in an 8 beat single Burst
Memory Buffer

Data Read at Speed of DIMM

1 RANK x4 DIMM with 18 DRAMs

64 Bits of data in an ECC word + additional bits for checksum


s

On a DIMM multiple DRAM modules are accessed to create a memory word. An “Industry Standard”
DIMM is commonly used to supply processors fetching 64 byte cache lines.
A rank of an x4 “Industry Standard” DIMM contains enough DRAMs to provide 64 bits of data at a time
with enough check bits to correct the case of a single DRAM module being bad after the bad DRAM has
been detected and then also at least correct an additional faulty bit.
The ability to generally deal with a DRAM module being entirely bad is what IBM has traditionally called
Chipkill correction. This capability is essential in protecting against a memory outage and should be
considered as a minimum error correction for any modern server design.
Inherently DIMMs with x8 DRAMs use half the number of memory modules and there can be a reliability
advantage just in that alone. From an error checking standpoint, only 4 bad adjacent bits in an ECC word
need to be corrected to have Chipkill correction with a x4 DIMM. When x8 DIMMs are used, a single
Chipkill event requires correcting 8 adjacent bits. A single Industry Standard DIMM in a typical design will
not have enough check bits to handle a Chipkill event in a x8 DIMM.
Rather than directly accessing the DIMMs through the processor, some server designs take advantage of
a separate memory buffer. A memory buffer can improve performance and it can also allow servers to
take advantage of two Industry standard DIMMs by extending the ECC word in some sense across two
DIMMs.

POWER9 and Power10 processor-based systems RAS Page


57
Figure 30: One Method of Filling a Cache Line Using 2 x8 Industry Standard DIMMs

Processor 128 Byte Cache line (1024 Bits)


Requires 8 ECC words from 2 DIMMs
Cache Each Accessed in an 8 beat single Burst

Memory Controller
A 64 Byte Cache line could be filled from 2 DIMMs,
Memory Bus
- Each Accessed in an chop-burst yielding ½ the data of full 8 beat burst
Driver/Receiver
- Or read independently and managed by the memory buffer
to fill the 64 byte cache line

Memory Buffer

1 RANK x8 DIMM with 9 DRAM Modules

1 RANK x8 DIMM with 9 DRAM Modules

Combined 128 Bits of data across two DIMMs in a beat + up to 16 additional bits for checksum

Hence across two DIMMs, with a x4 DRAM module, a single DRAM module being failed can be tolerated
and then another, or what might be called double-Chipkill though this is not a precise term. With x8
DRAMs at least across the two DIMMs a single Chipkill event can be corrected once the fault has been
identified and the bad DRAM marked.
In system designs where only 64 byte ECC caches are filled there can be a performance penalty
associated with running in 128 byte mode across two DIMMs since the cache lines are still 64 byte and
the DIMMs are still designed to access 64 bytes in burst. Hence in newer designs, 128 byte ECC word
access may only be switched to after the first Chipkill event.

POWER Traditional Design


IBM Power systems traditionally don’t use a 64 byte-cache line. Instead, a 128 byte cache line is used.
Traditionally an external IBM designed buffer module is used to allow the equivalent of two Industry
Standard DIMMs to be accessed to fill in the 128 byte cache line.
In POWER8, instead of using Industry Standard DIMMs, the servers used a custom DIMM design where
the IBM memory buffer was packaged on the DIMM with the equivalent of the DRAMs of four industry
standard DIMMs. In addition, the custom DIMMs carried additional spare DRAMs not found in industry
standard DIMMs.
With custom DIMM design, IBM could use x8 DRAMs where it was advantageous to do so, still mark out
a failed memory module, and on top of that offer spare DRAM modules.
For x4 DIMMs, Chipkill correction is provided with even more spare capacity.

1s and 2s Scale-out Design


The POWER9 processor used in 1 and 2 socket scale-out servers are designed internally for Industry
Standard DIMMs without an external buffer chip. The ECC checking is at the 64 bit level, so Chipkill
protection is provided with x4 DIMMs; plus some additional sub-Chipkill level error checking after a
Chipkill event. Because it is a 64 bit ECC word, however, there is no spare capability.

POWER9 and Power10 processor-based systems RAS Page


58
Figure 31 : 1s and 2s Memory Illustrated

IBM does not believe that using DIMMs without Chipkill capability is appropriate in this space, so x8
DIMMs are not offered.

Power E950 Design Point


As previously illustrated, the Power E950 makes use of IBM’s memory buffer and Industry standard
DIMMs.
Generally, with x4 DIMMs, across a pair of DIMMs, per rank, there is sufficient capability to provide
Chipkill capability; and after detection of a fault, to reconfigure to employ a spare DRAM module.
This provides what is termed Chipkill plus spare DRAM protection across the pair of DIMMs.
If x8 DIMMs were offered, there would be Chipkill correction but no spare. Believing that avoiding both the
unplanned and planned outages in this space is important, x8 DIMMs are not offered by IBM.
DRAM Row Repair
To further extend the ability to avoid planned outages for the Power E950, firmware released on
11/16/2018 or later support a concept known as DRAM row repair for certain DIMMs.
DRAMs on the 32GB, 64GB and 128GB DIMMs are manufactured with at least an extra, spare “row” of
data. If a Spare DRAM is used during operation due to a fault that is contained to a single row, on system
reboot, row repair can substitute the spare row for the bad row and bring the DRAM back into use. This
allows the recovered DRAM to be used again as a spare.
This can improve the self-healing capabilities of the DIMMs beyond simply the use of the spare DRAM
module alone.
In future firmware releases, DRAM row repair might be extended to the 8GB and 16GB DIMMs as well.

POWER9 and Power10 processor-based systems RAS Page


59
Power E980 Design Point
DDR4 Custom DIMMs as described previously are used in the Power E980 design. This allows for both
x4 and x8 DIMMs to be used and still offer more spare capability than the Power E950 that uses standard
DIMMs.

Figure 32: Memory Subsystem using CDIMMs

Memory Controller
§ Supports 128 Byte Cache Line
§ Hardened “Stacked” Latches for Soft Error Protection Memory Ctrl POWER8 DCM with
§ And replay buffer to retry after soft internal faults 8 Memory Buses
Supporting 8 DIMMS
§ Special Uncorrectable error handling for solid faults

Memory Bus
§ CRC protection with recalibration and retry on error Memory Ctrl

§ Spare Data lane can be dynamically substituted for


failed one
Memory Bus
Memory Buffer
§ Same technology as POWER8 Processor Chips
– Hardened “Stacked” Latches for Soft Error Protection
§ Can retry after internal soft Errors Memory
L4
L4 Cache
Buffer
§ L4 Cache implemented in eDRAM technology
Memory 2 Rank DIMM Supporting 2
– DED/SEC ECC Code L4
L4 Cache
Buffer 128 Bit ECC word DRAM Groups
– Persistent correctable error handling

16 GB DIMM
§ 4 Ports of Memory
– 10 DRAMs x8 DRAMs attached to each port
– 8 Needed For Data
– 1 Needed For Error Correction Coding
– 1 additional Spare
§ 2 Ports are combined to form a 128 bit ECC Note: Bits used for data and for ECC are spread across
word 9 DRAMs to maximize error correction capability
– 8 Reads fill a processor cache
§ Second port can be used to fill a second DRAM Protection:
cache line Can handle at least 2 bad x8 DRAM modules in every group of 18
(3 if not all 3 failures on the same sub-group of 9)
– (Much like having 2 DIMMs under one Memory
buffer but housed in the same physical DIMM)

Memory RAS Beyond ECC


A truly robust memory design incorporates more RAS features than just the ECC protection of course.
The ability to use hardware accelerated scrubbing to refresh memory that may have experienced soft
errors is a given. The memory bus interface is also important. The direct bus attach memory used in the
P9 scale-out servers supports RAS features expected in that design including register clock driver (RCD)
parity error detection and retry.
IBM’s CDIMM memory buffer supports bus CRC with spare data-lane and retry as previously described
plus soft error handling features inside the buffer. The buffer chip also contains an l4 cache which has its
own performance and RAS attributes separate from the memory discussion here.

Hypervisor Memory Mirroring


As an additional capability to memory sub-systems RAS, the ability to mirror memory can provide an
additional protection against memory related outages.
In general mirroring memory has a down-side. There is additional cost of the mirrored memory, mirroring
memory can reduce memory capacity and may also have an impact on performance such as due to the
need to write to two different memory locations whenever data is stored.

POWER9 and Power10 processor-based systems RAS Page


60
POWER9 Scale-up processors provide a means to mirror just the memory used in critical areas of the
PowerVM hypervisor. This provides protection to the hypervisor so that it does not need to terminate due
to faults on a DIMM that cause uncorrectable errors.

Figure 33: Active Memory Mirroring for the Hypervisor

Active Memory Mirroring for the Hypervisor

Mirrored LMBs
Writes Go to Each Side
Reads alternate between sides
Or from one side only when DIMM fails

Memory Ctrl
Memory Ctrl

Partition Logical Memory Blocks


Not Mirrored
LMBs deallocated on partition termination
Hypervisor Partition Partition Partition
If Uncorrectable Memory Errors Are Seen C
Partitions restarted using memory from
Memory B A
Other DIMMs

Hypervisor
Memory

By selectively mirroring only the segments used by the hypervisor, this protection is provided without the
need to mirror large amounts of memory.
It should be noted that the PowerVM design is a distributed one. PowerVM code can execute at times on
each processor in a system and can reside in small amounts in memory anywhere in the system.
Accordingly, the selective mirroring approach is fine grained enough not to require the hypervisor to sit in
any particular memory DIMMs. This provides the function while not compromising the hypervisor
performance as might be the case if the code had to reside remotely from the processors using a
hypervisor service.

Dynamic Deallocation/Memory Substitution


For Enterprise Systems with custom DIMMs, the primary means of deallocating predictively bad memory
in the system is leveraging the extensive set of spare DRAMs provided.
For errors impacting only a small amount of memory, PowerVM can deallocate a single page of memory
through the OS for a single cell fault in memory.
For more severe errors, after spare memory is used, PowerVM can deallocate memory at logical memory
block level. That may mean terminating partitions using memory after uncorrectable errors are
encountered. For predictive errors, deallocation can be accompanied by substituting unallocated logical
memory blocks for the deallocated blocks where there is enough unallocated memory. Where not enough

POWER9 and Power10 processor-based systems RAS Page


61
unallocated memory, PowerVM may deallocate additional blocks once workload using the predictively
failing memory frees up the memory.
It should be understood that typically, memory is interleaved across multiple DIMMs for maximum
performance and when deallocating memory, the entire set of memory in an interleave group has to be
deallocated and, where available, substituted for the equivalent amount of spare memory.
Interleaving amounts may vary but it is possible that all of the DIMMs controlled by a processor module
may be interleaved together. Providing fully spare memory would necessitate having another processor’s
worth of memory unallocated in such a case.
Where there is insufficient spare memory available, as memory is released within partitions, up to an
entire processor’s worth of memory may eventually become unavailable to the system pending repair of
the failed DIMM.
On a reboot, the memory interleaving will be changed to just deconfigure the minimum number of DIMMs
required to account for the failed DIMM while allowing the rest of the memory controlled by the processor
to be used.
For instance, if 8 - 128GB DIMMs are attached to a processor socket they can be grouped as a single
interleave unit. An unrecoverable error (UE), on one of the 8 DIMMs, will result in the other 7 – 128GB
DIMMs being garded at system runtime. All LPARs that are allocated memory in the 1024GB memory
group that contains the UE, will experience an outage. Upon a system reboot, the remaining 7 good
memory DIMMs can be recovered and reallocated to LPARs, after being garded. A system reboot is
required, because the size of the interleave group cannot be adjusted dynamically at runtime.

RAS beyond Processors and Memory


Introduction
Processor and memory differences aside, there are many design considerations and features that
distinguish between scale-out and enterprise systems and between different enterprise systems as well.
One of these is the amount of infrastructure redundancy provided.
In theory, providing redundant components in a system may be thought of as a highly effective way to
avoid outages due to failures of a single component. The care with which redundancy is designed and
implemented plays an important factor on how effective redundancy really is in eliminating failures in a
subsystem.
There are a number of considerations as described below.

Serial Failures, Load Capacity and Wear-out


Nearly any system running enterprise applications will supply redundant power supplies. This power
supply redundancy is typically meant to allow that if one power supply fails, the other can take up the
power load and continue to run. Likewise, most systems are designed to tolerate a failure in fans needed
to cool the system. More sophisticated systems may increase fan speed to compensate for a failing fan,
but a single fan failure itself should not cause a significant cooling issue.
In such a design, it would be expected that a system would not experience an outage due to a single
power supply or fan fault. However, if a power supply fails in a redundant design and the second power
supply should happen to fail before it is repaired, then the system will obviously be down until one or the
other of the supplies is fixed. The expectation is that this would be a rare event.
If the system were incapable of determining that one of a pair of redundant parts had failed, then this can
be more common, however. The ability to constantly monitor the health of the secondary component is
therefore essential in a redundant design, but not always easy.
For example, two power supplies may share the load in a system by supplying power to components.
When no power is supplied to a component, that condition is fairly easy to detect. If one power supply
were able to supply some current, but an insufficient amount to carry the load by itself, depending on the
design, the fact that the supply is “weak” may not be detected until the good supply fails.

POWER9 and Power10 processor-based systems RAS Page


62
In a lightly loaded system, it may not even be possible to distinguish between a “weak” supply and one
that is providing no current at all. In some redundant designs for light loads, only one supply may even be
configured to carry the load.
Even when both supplies are operating optimally, if a power supply is not well tested, designed and
specified to run a system indefinitely on a single power source; it may happen that when the first power
supply fails, the second carries a load that stresses it to the point where it soon also fails.
This kind of failure mode can be exasperated perhaps by environmental conditions: Say the cooling in the
system is not very well designed so that a power supply runs hotter than it should. If this can cause a
failure of the first supply over time, then the back-up supply might not be able to last much longer under
the best of circumstances, and when taking over a load would soon be expected to fail.
As another example, fans can also be impacted if they are placed in systems that provide well for cooling
of the electronic components, but where the fans themselves receive excessively heated air that is a
detriment to the fans long-term reliability.
Therefore, understanding that excessive heat is one of the primary contributors to component “wear-out”,
IBM requires that even components providing cooling to other components should be protected from
excessive heat.

Common Mode Failures


Still even with well-designed redundancy and elimination of serial failures, and attention to component
cooling, some faults can occur within a sub-system where redundancy is insufficient to protect against
outages.
One such category includes faults that are not detected in the primary source, code issues, and intrinsic
limitations on fail-over capabilities.
Another kind of limitation may be called a common mode failure. For example, when two power supplies
are given the same AC power source. If that source fails, then the system will go down despite the
redundancy. But failures include events besides simple power loss. They can include issues with surges
due to electrical storm activity, short dips in power due to brown-out conditions, or perhaps when
converting to backup power generation.
These incidents will be seen by both supplies in a redundant configuration and all the power supplies
need to be able to withstand transient faults.
As another example, suppose that two supplies end up providing voltage to a common component, such
as a processor module. If any power input to the module were to be shorted to ground, it would be
expected that both supplies would see this fault. Both would have to shut down to prevent an over-current
condition.
In an installed and running system, one would rarely expect to see wires themselves suddenly shorting to
ground. However, if a component such as a cable or card were allowed to be removed or replaced in a
system, without a good design to protect against it, even an experienced person doing that service
activity could cause a short condition.
Also, the failure of certain components may essentially manifest as a short from power to ground.
Something as simple as a capacitor used for electrical noise decoupling could fail in that fashion.
Proper system design would mitigate the impact of these events by having soft-switches, or effectively
circuit breakers isolating key components, especially those that can be hot-plugged.
Ultimately there will be someplace electrically where redundant components come together to provide a
function, and failures there can cause outages.

Fault Detection/Isolation and Firmware and Other Limitations


The effectiveness of redundancy, especially when a failover is required, can depend also on when and
how faults are detected and whether any firmware is required to properly implement a failover. It may not
be possible to cover all events. Further, even when a fault is detected and failover is initiated, defects in

POWER9 and Power10 processor-based systems RAS Page


63
the hardware or firmware design, including subtle timing windows, could impact whether the failover was
successful.
The ability of an organization to inject errors and test failing modes, therefore, is essential in validating a
redundant design strategy.
Power and Cooling Redundancy Details

Power Supply Redundancy


Power supplies in a system broadly refer to components that take utility power from the data center
(typically alternating current, or AC power) and convert to DC power that is distributed throughout the
system. Power supplies generally supply one DC voltage level (e.g., 12 volts) and voltage regulators may
be used to supply different voltage levels required by various system voltages.
Power supply redundancy has two main goals. The first is to make sure that a system can continue to
operate when a power supply fails. This is usually known as power supply redundancy. Conceptually if a
system had four power supplies and could continue to run with one supply failed, that would be
considered n+1 redundancy with 3 supplies needed for operation and 1 redundant supply.
The second goal is to allow for the data center to supply two sources of power into the system (typically
using two power distribution units or PDUs). Should one of these sources fail the system should continue
to operate.
In theory this could be achieved by feeding each power supply with two separate line cords, one from
each PDU. Depending on the design and how one power source failed, however, there could be
scenarios of power supply failure where operation cannot be maintained.
Alternative, multiple power supplies might be supplied. For example, a system may be designed with four
power supplies, E1, E2, E3 and E4 with E1 and E2 designed to connect to one power distribution unit
(PDU) and E3 and E4 to another PDU.
If the system can run one of the PDUs failing, even if it causes some performance degradation for some
loads, it may be said to also have “line cord redundancy.”
If 4 power supplies are used in this fashion then when a PDU fails, the system will be running with two
power supplies. It may be expected, therefore that this would provide 2+2 or n+2 redundancy. However,
this need not always be the case. The system may be designed to operate when certain power supplies
fail that are supplied to a single PDU, but not when one power supply under each PDU fails. Or it may be
designed to run with two supplies but have performance degradation for certain configurations/loads.

Voltage Regulation
There are many different designs that can be used for supplying power to components in a system.
As described above, power supplies may take alternating current (AC) from a data center power source,
and then convert that to a direct current voltage level (DC).
Modern systems are designed using multiple components, not all of which use the same voltage level.
Possibly a power supply can provide multiple different DC voltage levels to supply all the components in a
system. Failing that, it may supply a voltage level (e.g., 12v) to voltage regulators which then convert to
the proper voltage levels needed for each system component (e.g., 1.6 V, 3.3 V, etc.) Use of such voltage
regulators work to maintain voltage levels within the tight specifications required for the modules they
supply.
Typically, a voltage regulator module (VRM) has some common logic plus a component or set of
components (called converters, channels, or phases). At minimum, a VRM provides one converter (or
phase) that provides the main function of stepped-down voltage, along with some control logic.
Depending on the output load required, however, multiple phases may be used in tandem to provide that
voltage level.
If the number of phases provided is just enough for the load it is driving, the failure of a single phase can
lead to an outage. This can be true even when the 12V power supplies are redundant. Therefore,
additional phases may be supplied to prevent the failure due to a single-phase fault. Additional phases

POWER9 and Power10 processor-based systems RAS Page


64
may also be provided for sparing purposes. The distinction between spare and redundant is that when a
spare phase fails, the system continues to operate without the need to repair. After any spare phases fail,
the failure of a redundant phase will require a service action.
While it is important to provide redundancy to critical components within a device such as an adapter or
DIMM may not be necessary if the devices themselves are redundant within the system.
There are other applications for voltage conversion or division that are not as power-demanding as the
applications above: A regulator only used during IPL for initialization of a component for example, or a
memory DIMM or riser which takes supplied voltage and further divides on card for purposes such as
reference voltage or signal termination. Such uses are not included in the discussion of voltage regulation
or voltage regulator modules discussed in the rest of this paper.
POWER9 1 and 2 socket systems as scale-out models do not provide redundant voltage phases. This is
very typical of systems in the scale-out space and often in systems that are considered Enterprise.
Generally speaking, the Power E950 provides an n+1 redundant phase for VRMs feeding the critical
components previously mentioned: the processor, the VRMs going out to the memory riser and VRMs for
certain other components.
The Power E980 systems provides this n+1 redundancy to the equivalent elements (such as to the
processors and to the custom DIMMs) but goes a step further and provides an additional spare phase for
these elements.
Redundant Clocks
Vendors looking to design enterprise class servers may recognize the desirability of maintaining
redundant processor clocks so that any hard failure of a single clock oscillator doesn’t require a system to
go down and stay down until repaired.
The POWER9 scale-up processor is designed with redundant clock inputs. In POWER8 systems a global
clock source for all the processor components was required. Hence two global processor clock cards
were provided in the system control drawer of the systems and a dynamic failover was provided.
In the Power E980 the design has been changed so that the main (run-time) processor core clock needs
to stay synchronized within each CEC drawer (node) rather than the entire system. Hence the processor
clock logic is now duplicated for each drawer in the system.
In this design the clock logic for the processor is now separate from a multi-function clock used for such
components as PCIE bus. A fault on this multi-function clock where redundancy is provided can be
handled through PCIe recovery.
Internal Cache Coherent Accelerators
To accelerate certain computational workloads beyond the traditional computational units in each core,
POWER8 had a facility called the NX unit which provided a cache coherent interface to certain processor
level resources such as a random number generator.
In addition, the Coherent Accelerator Processor Proxy (CAPP) module provided a way to interface to
computational accelerators external to the processor. The CAPP communicated through PCIe adapter
slots through a Coherent Accelerator Processor Interface (CAPI) which allowed the connected adapters,
unlike regular I/O devices, to participate in the systems cache coherency domain through the CAPP.
POWER9 still has an NX unit for certain system resources and support for CAPI (at a 2.0 level to be
consistent with Gen4 PCIe) though the internal design for these has somewhat changed.
It should be noted, however, that like any I/O adapter design, the RAS characteristics of what is attached
to through the link is dependent on the design of the specific accelerator attached. These can vary
according to the design and purpose of such accelerators.
Service Processor and Boot Infrastructure
There is more function than just the service processor itself that can be considered part of the service
processor infrastructure in a system.

POWER9 and Power10 processor-based systems RAS Page


65
Systems with a single service processor still have some service processor infrastructure that may be
redundant (i.e., having two system VPD modules on a VPD card.) Still, the Power E980 system provides
the highest degree of service processor infrastructure redundancy among the systems being discussed.
In particular, it should be noted that for a system to boot or IPL, a system needs to have a healthy service
processor, a functioning processor to boot from (using the internal self-boot engine) as well as functioning
firmware.
In systems with a single service processor there is a single processor module and self-boot-engine that
can be used for booting, and a single module used to store the associated firmware images.
In the Power E980, each system has two service processors. Each service processor can use a different
processor module’s self-boot-engine on each node and each of these two processor modules has access
to a different firmware module.
Trusted Platform Module (TPM)
Each of the systems discussed in this paper also incorporates a Trusted Platform Module (TPM) for
support of Secure Boot and related functions. 3 For redundancy purposes, the Power E980 system ships
with two TPMs per node. Other systems have a single TPM.
Within the active service processor ASMI interface, a “TPM required” policy can be enabled. A system or
node will not boot if the system is in secure mode, TPM is required, and no functional TPM is found. Note
that in the Power E1080 systems the same concept of two TPM modules per drawer is still used.
Concerning TPM redundancy, when a TPM is found to be faulty during initial IPL test, the redundant TPM
module can be used. Once a TPM module is chosen for IPL, the other TPM module in the redundant pair
is disabled and cannot be further used. During run-time systems with multiple drawers may be able
to take advantage of the multiple TPMs should one TPM fail. To understand the latest function, IBM
documentation concerning the TPM use should be referenced.

I/O Subsystem and VIOS™


The descriptions for each system describe how internal I/O slots and external I/O drawers can be used for
PCIe adapters.
In a PCIe environment, not all I/O adapters require full use of the bandwidth of a bus; therefore, lane
reduction can be used to handle certain faults. For example, in an x16 environment, loss of a single lane,
depending on location, could cause a bus to revert to a x8, x4, x2 or x1 lane configuration in some cases.
This, however, can impact performance.
Not all faults that impact the PCIE I/O subsystem can be handled just by lane reduction. It is important
when looking at I/O to not just consider all faults that cause loss of the I/O adapter function.
Power Systems are designed with the intention that traditional I/O adapters will be used in a redundant
fashion. In the case, as discussed earlier where two systems are clustered together, the LAN adapters
used to communicate between processors and the SAN adapters would all be redundant.
In a single 1s or 2s system, that redundancy would be achieved by having one of each type of adapter
physically plugged into a PCIe socket controlled by one PCIe processor, and the other in a slot controlled
by another.
The software communicating with the SAN would take care of the situation that one logical SAN device
might be addressed by one of two different I/O adapters.
For LAN communicating heart-beat messages, both LAN adapters might be used, but messages coming
from either one would be acceptable, and so forth.

3
https://www.ibm.com/support/knowledgecenter/en/POWER9/p9ia9/p9ia9_signatures_keys.htm

POWER9 and Power10 processor-based systems RAS Page


66
With the configuration when there is a fault impacting an adapter, the redundant adapter can take over. If
there is a fault impacting the communication to a slot from a processor, the other processor would be
used to communicate to the other I/O adapter.
The error handling throughout the I/O subsystem from processor PCIe controller to I/O adapter is
intended so that when a fault occurs anywhere on the I/O path, the fault can be contained to the
partition(s) using that I/O path.
Furthermore, PowerVM supports the concept of I/O virtualization with VIOS™ so that I/O adapters are
owned by I/O serving partitions. A user partition can access redundant I/O servers so that if one fails
because of an I/O subsystem issue, or even a software problem impacting the server partition, the user
partition with redundancy capabilities as described should continue to operate.
This End-to-End approach to I/O redundancy is a key contributor to keeping applications operating in the
face of practically any I/O adapter problem. This concept is illustrated below using a figure first published
in the POWER8 RAS whitepaper.

Figure 34: End-to-End I/O Redundancy

Physical Redundancy Through-out Hardware


And I/O Virtualization Provides
DCM DCM Protection Against Outages
Processor Chip Processor Chip Processor Chip Processor Chip
Here
PCIe Ctrl PCIe Ctrl PCIe Ctrl PCIe Ctrl
PHB PHB PHB PHB PHB PHB PHB PHB Here
x16 x16 x16 x16
x8 x8 x8 x8 Here
I/O Slot
I/O Slot

I/O Slot

I/O Slot

PCIe PCIe PCIe PCIe


Switch Switch Switch Switch
Here

Additional
Slots/Integrat Here
ed I/O
LAN SAN LAN SAN
Adapter Adapter Adapter Adapter

And Here

Virtual I/O Server 2 Virtual I/O Server 1

Virtual

PCIe Gen3 Expansion Drawer Redundancy


As elsewhere described the optically connected PCIe Gen3 I/O expansion drawer provides significant
RAS features including redundant fans/power supplies and independently operating I/O modules. Certain
components such as the mid-plane, will require that the drawer be powered off during the repair and
could potentially impact operation of both I/O modules.
For the highest level of redundancy, it is recommended that redundant adapter pairs be connected to
separate I/O drawers, and these separate I/O drawers be connected to different processor modules
where possible.

POWER9 and Power10 processor-based systems RAS Page


67
Figure 35: Maximum Availability with Attached I/O Drawers

Physical

DCM DCM

Processor Chip Processor Chip Processor Chip Processor Chip


PCIe Ctrl PCIe Ctrl PCIe Ctrl PCIe Ctrl
PHB PHB PHB PHB PHB PHB PHB PHB
x16 x16 x16 x16

I/O Slot
I/O Slot

I/O Slot

I/O Slot
I/O I/O
Attach Attach
Card Card

x8 x8 x8 x8
PCIe Switch PCIe Switch PCIe Switch PCIe Switch
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot

I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot

I/O Slot

I/O Slot

I/O Slot
I/O Slot
I/O Slot

I/O Slot
I/O Slot

I/O Slot

I/O Slot
I/O Slot

I/O Slot

I/O Drawer Built With Considerable Ability


To Isolate Failures Across I/O modules
I/O Module I/O Module I/O Module I/O Module And With Significant Infrastructure Redundancy

SAN
Adapter
LAN
Adapter
LAN
Adapter
SAN
Adapter
Availability still Maximized Using
Redundant I/O Drawers
Gen3 I/O Drawer 2 Gen3 I/O Drawer 1

Virtual I/O Server 2 Virtual I/O Server 1

Planned Outages
Unplanned outages of systems and applications are typically very disruptive to applications. This is
certainly true of systems running standalone applications, but is also true, perhaps to a somewhat lesser
extent, of systems deployed in a scaled-out environment where the availability of an application does not
entirely depend on the availability of any one server. The impact of unplanned outages on applications in
both such environments is discussed in detail in the next section.
Planned outages, where the end-user picks the time and place where applications must be taken off-line
can also be disruptive. Planned outages can be of a software nature – for patching or upgrading of
applications, operating systems, or other software layers. They can also be for hardware, for
reconfiguring systems, upgrading or adding capacity, and for repair of elements that have failed but have
not caused an outage because of the failure.
If all hardware failures required planned downtime, then the downtime associated with planned outages in
an otherwise well-designed system would far-outpace outages due to unplanned causes.
While repair of some components cannot be accomplished with workload actively running in a system,
design capabilities to avoid other planned outages are characteristic of systems with advanced RAS
capabilities. These may include:

Updating Software Layers


Maintaining updated code levels up and down the software stack may avoid risks of unplanned outages
due to code bugs. However, updating code can require planned outages of applications, partitions, or
entire systems.
Generally, systems are designed to allow a given level of “firmware” to be updated in the code used by
service processors, the PowerVM hypervisor and other areas of system hardware, without needing an
outage, though exceptions can occur.

POWER9 and Power10 processor-based systems RAS Page


68
Migrating from one firmware level to another, where a level provides new function, is not supported
dynamically.
Dynamic updating of hypervisors other than the PowerVM hypervisor and of operating systems and
applications depend on the capabilities of each such software layer.

Concurrent Repair
When redundancy is incorporated into a design, it is often possible to replace a component in a system
without taking the entire system down.
As examples, Enterprise Power Systems support concurrently removeable and replaceable elements
such as power supplies and fans.
In addition, Enterprise Power Systems as well as POWER9 processor-based 1s and 2s systems support
concurrently removing and replacing I/O adapters according to the capabilities of the OS and
applications.

Integrated Sparing
As previously mentioned, to reduce replacements for components that cannot be removed and replaced
without taking down a system, Power Systems strategy includes the use of integrated spare components
that can be substituted for failing ones.

POWER9 and Power10 processor-based systems RAS Page


69
Clustering and Cloud Support
PowerHA SystemMirror
IBM Power Systems running under PowerVM and AIX™ and Linux support a spectrum of clustering
solutions. These solutions are designed to meet requirements not only for application availability as
regards to server outages, but also data center disaster management, reliable data backups and so forth.
These offerings include distributed applications such as with db2 pureScale™, HA solutions using
clustering technology with PowerHA™ SystemMirror™ and disaster management across geographies
with PowerHA SystemMirror Enterprise Edition™.
It is beyond the scope of this paper to discuss the details of each of the IBM offerings or other clustering
software, especially considering the availability of other material.

Live Partition Mobility


However, Live Partition Mobility (LPM), available for Power Systems running PowerVM Enterprise Edition,
will be discussed here in particular with reference to its use in managing planned hardware outages.
LPM is a technique that allows a partition running on one server to be migrated dynamically to another
server.

Figure 36: LPM Minimum Configuration

Dynamic Logical Partitions Dynamic Logical Partitions

PowerVM Enterprise Edition PowerVM Enterprise Edition


Private
Computational Virtualized LAN Adapters LAN Adapters Virtualized Computational
LAN
Hardware I/O Servers I/O Servers Hardware
Fibre Channel Fibre Channel
Adapters Shared Storage Adapters
Using a SAN

Infrastructure Infrastructure

Hardware
Service Processor Management Service Processor
Console (HMC)

In simplified terms, LPM typically works in an environment where all the I/O from one partition is
virtualized through PowerVM and VIOS and all partition data is stored in a Storage Area Network (SAN)
accessed by both servers.
To migrate a partition from one server to another, a partition is identified on the new server and
configured to have the same virtual resources as the primary server including access to the same logical
volumes as the primary using the SAN.
When an LPM migration is initiated on a server for a partition, PowerVM begins the process of
dynamically copying the state of the partition on the first server to the server that is the destination of the
migration.
Thinking in terms of using LPM for hardware repairs, if all the workloads on a server are migrated by LPM
to other servers, then after all have been migrated, the first server could be turned off to repair
components.
LPM can also be used for doing firmware upgrades or adding additional hardware to a server when the
hardware cannot be added concurrently in addition to software maintenance within individual partitions.

POWER9 and Power10 processor-based systems RAS Page


70
When LPM is used, while there may be a short time when applications are not processing new workload,
the applications do not fail or crash and do not need to be restarted. Roughly speaking then, LPM, allows
for planned outages to occur on a server without suffering downtime that would otherwise be required.

Minimum Configuration
For detailed information on how LPM can be configured the following references may be useful: An IBM
Redbook titled: IBM PowerVM Virtualization Introduction and Configuration4 as well as the document
Live Partition Mobility 5
In general terms LPM requires that both the system containing a partition to be migrated and the system
being migrated have a local LAN connection using a virtualized LAN adapter. In addition, LPM requires
that all systems in the LPM cluster be attached to the same SAN. If a single HMC is used to manage both
systems in the cluster, connectivity to the HMC also needs to be provided by an Ethernet connection to
each service processor.
The LAN and SAN adapters used by the partition must be assigned to a Virtual I/O server and the
partitions access to these would be by virtual LAN (vLAN) and virtual SCSI (vSCSI) connections within
each partition to the VIOS.
I/O Redundancy Configurations and VIOS
LPM connectivity in the minimum configuration discussion is vulnerable to a number of different hardware
and firmware faults that would lead to the inability to migrate partitions. Multiple paths to networks and
SANs are therefore recommended. To accomplish this, Virtual I/O servers (VIOS) can be used.
VIOS as an offering for PowerVM virtualizes I/O adapters so that multiple partitions will be able to utilize
the same physical adapter. VIOS can be configured with redundant I/O adapters so that the loss of an
adapter does not result in a permanent loss of I/O to the partitions using the VIOS.
Externally to each system, redundant hardware management consoles (HMCs) can be utilized for greater
availability. There can also be options to maintain redundancy in SANs and local network hardware.

Figure 37: I/O Infrastructure Redundancy

Dynamic Logical Partitions Dynamic Logical Partitions

PowerVM Enterprise Edition PowerVM Enterprise Edition


Computational Virtualized Private Virtualized Computational
LAN Adapters LAN Adapters
I/O Servers LAN I/O Servers Hardware
Hardware Private
LAN

Fibre Channel Fibre Channel


Shared Storage
Adapters Shared Storage Adapters
Using a SAN
Using a SAN

Infrastructure Infrastructure
Hardware
Management
Hardware
Service Processor Console (HMC)
Management Service Processor
Console (HMC)

Figure generally illustrates multi-path considerations within an environment optimized for LPM.
Within each server, this environment can be supported with a single VIOS. However, if a single VIOS is
used and that VIOS terminates for any reason (hardware or software caused) then all the partitions using
that VIOS will terminate.

4
Mel Cordero, Lúcio Correia, Hai Lin, Vamshikrishna Thatikonda, Rodrigo Xavier, Sixth Edition published June 2013,
5
IBM, 2018, ftp://ftp.software.ibm.com/systems/power/docs/hw/p9/p9hc3.pdf

POWER9 and Power10 processor-based systems RAS Page


71
Using Redundant VIOS servers would mitigate that risk. There is a caution, however that LPM cannot
migrate a partition from one system to another when a partition is defined to use a virtual adapter from a
VIOS and that VIOS is not operating. Maintaining redundancy of adapters within each VIOS in addition to
having redundant VIOS will avoid most faults that keep a VIOS from running. Where redundant VIOS are
used, it should also be possible to remove the vscsi and vlan connections to a failed VIOS in a partition
before migration to allow migration to proceed using the remaining active VIOS in a non-redundant
configuration.

Figure 38: Use of Redundant VIOS

Logical Partition

vscsi vlan vscsi vlan


Virtualized Virtualized
I/O Server 1 I/O Server 2

LAN fc LAN fc
adapter adapter adapter adapter

LAN fc LAN fc
adapter adapter adapter adapter

Since each VIOS can largely be considered as an AIX based partition, each VIOS also needs the ability
to access a boot image, having paging space, and so forth under a root volume group or rootvg. The
rootvg can be accessed through a SAN, the same as the data that partitions use. Alternatively, a VIOS
can use storage locally attached to a server, either DASD devices or SSD drives such as the internal
NVMe drives provided for the Power E980 and Power E950 systems. For best availability, the rootvgs
should use mirrored or other appropriate RAID drives with redundant access to the devices.
PowerVC™ and Simplified Remote Restart
PowerVC is an enterprise virtualization and cloud management offering from IBM that streamlines virtual
machine deployment and operational management across servers. The IBM Cloud PowerVC Manager
edition expands on this to provide self-service capabilities in a private cloud environment; IBM offers a
Redbook that provides a detailed description of these capabilities. As of the time of this writing: IBM
PowerVC Version 1.3.2 Introduction and Configuration6 describes this offering in considerable detail.
Deploying virtual machines on systems with the RAS characteristics previously described will best
leverage the RAS capabilities of the hardware in a PowerVC environment. Of interest in this availability
discussion is that PowerVC provides a virtual machine remote restart capability, which provides a means
of automatically restarting a VM on another server in certain scenarios (described below).
Systems with a Hardware Management Console (HMC) may also choose to leverage a simplified remote
restart capability (SRR) using the HMC.
Error Detection in a Failover Environment
The conditions under which a failover is attempted It is important when talking about any sort of failover
scenario. Some remote restart capabilities, for example, operate only after an error management system,
e.g., an HMC reports that a partition is in an Error or Down State.

6
January 2017, International Technical Support Organization, Javier Bazan Lazcano and Martin Parrella

POWER9 and Power10 processor-based systems RAS Page


72
This alone might miss hardware faults that just terminate a single application or impact the resources that
an application uses, without causing a partition outage. Operating system “hang” conditions would also
not be detected.
In contrast, PowerHA leverages a heartbeat within a partition to determine when a partition has become
unavailable. This allows for fail-over in cases where there is a software or other cause while a partition is
not able to make forward progress even if an error is not recorded.
It is a consideration that for the highest level of application availability an application itself might want to
leverage some sort of heartbeat mechanism to determine when an application is hung or unable to make
forward progress.

POWER9 and Power10 processor-based systems RAS Page


73
Section 4: Reliability and Availability in the Data Center
The R, A and S of RAS

Introduction
All of the previous sections in this document discussed server specific RAS features and options. This
section looks at the more general concept of RAS as it applies to any system in the data center. The goal
is to briefly define what RAS is and look at how reliability and availability are measured. It will then
discuss how these measurements may be applied to different applications of scale-up and scale-out
servers.

RAS Defined
Mathematically, reliability is defined in terms of infrequently something fails.
At a system level, availability is about how infrequently failures cause workload interruptions. The longer
the interval between interruptions, the more available a system is.
Serviceability is all about how efficiently failures are identified and dealt with, and how application outages
are minimized during repair.
Broadly speaking systems can be categorized as ”scale-up” or ”scale-out” depending on the impact to
applications or workload of a system being unavailable.
True scale-out environments typically spread workload among multiple systems so that the impact of a
single system failing, even for a short period of time is minimal.
In scale-up systems, the impact of a server taking a fault, or even a portion of a server (e.g., an individual
partition) is significant. Applications may be deployed in a clustered environment so that extended
outages can in a certain sense be tolerated (e.g., using some sort of fail-over to another system) but even
the amount of time it takes to detect the issue and fail-over to another device is deemed significant in a
scale-up system.
Reliability Modeling
The prediction of system level reliability starts with establishing the failure rates of the individual
components making up the system. Then using the appropriate prediction models, the component level
failure rates are combined to give us the system level reliability prediction in terms of a failure rate.
In literature, however, system level reliability is often discussed in terms of Mean Time Between Failures
(MTBF) for repairable systems rather than a failure rate. For example, 50 years Mean Time Between
Failures. A 50 years MTBF may suggest that a system will run 50 years between failures, but means
more like that given 50 identical systems, one in a year will fail on average over a large population of
systems.
The following illustration explains roughly how to bridge from individual component reliability to system
reliability terms with some rounding and assumptions about secondary effects:

POWER9 and Power10 processor-based systems RAS Page


74
Figure 39: Rough Reliability Cheat Sheet*

Failures Per 100,000 Parts By Quarter Ideally Systems are composed of multiple parts each
12
with very small failure rates that may vary over time.
For Example
10

8 ▪ 10 parts in 100,000 fail in first quarter (early life failure) then


Failures

Early Life Wear Out tailing off


6

4
▪ 1 part in 1 100,000 for a some time (steady state failure rate)
\ 2 Steady State ▪ Then increasing rate of failures until system component in
0
use is retired (wear-out rate)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
▪ Typically this is described as a bathtub curve
Quarter

If 1 year is rounded up to 10000 hrs then Roughly:

FIT = Failure In Time Roughly 100 FITs means Mean Time Between 1 fail/1000 Systems
1 FIT = 1 Failure per 1 A Failure Rate of 1 Failure in Failures is the inverse per year
Billion hours 1000 systems per year of Failure Rate = 1000 Year MTBF

When FITs are small

FIT of Component 1 +
1
FIT of Component 2 + Mean Time
= Between Failure
System System
FIT of Component 3 + Failure Rate Failure Rate (MTBF)

FIT of Component n +

Example
100 FITs +
50 FITs +
70 FITs + If 100 Fits is 1 Fail per 1000
30 FITs + systems per year
200 FITs + -------------------- 1000 Systems/10 Failures in a year
50 FITs + 1000 Fits = equates to 100 Years MTBF
30 FITs + 10 Failures Per 1000 Systems
170 FITs +
per Year
50 FITs +
250 FITs =
1,000 FITs For System

Different Levels of Reliability


When a component fails, the impact of that failure can vary depending on the component.
A power supply failing, in a system with a redundant power supply will need to be replaced. By itself,
however, a failure of a single power supply should not cause a system outage and should lead to a
concurrent repair with no down-time to replace.
There are other components in a system might fail causing a system-wide outage where concurrent
repair is not possible.
Therefore, it is typical to talk about different MTBF numbers.
For example:
MTBF – Resulting Repair Actions
MTBF – That require concurrent repair

POWER9 and Power10 processor-based systems RAS Page


75
MTBF – That require a non-concurrent repair
MTBF – Resulting in an unplanned application outage
MTBF – Resulting in an unplanned system outage
Scale-out systems may invest in having a long MTBF for unplanned outages even if it means more
recurrent repair actions.
Costs and Reliability

Service Costs
It is common for software costs in an enterprise to include the cost to acquire or license code and a
recurring cost for software maintenance. There is also a cost associated with acquiring the hardware and
a recurring cost associated with hardware maintenance – primarily fixing systems when components
break.
Individual component failure rates, the cost of the components, and the labor costs associated with repair
can be aggregated at the system level to estimate the direct maintenance costs expected for a large
population of such systems.
The more reliable the components the less the maintenance costs should be.
Design for reliability can not only help with maintenance cost to a certain degree but also with the initial
cost of a system as well – in some cases. For example, designing a processor with extra cache capacity,
data lanes on a bus or so forth can make it easier to yield good processors at the end of a manufacturing
process as an entire module need not be discarded due to a small flaw.
At the other extreme, designing a system with an entire spare processor socket could significantly
decrease maintenance cost by not having to replace anything in a system should a single processor
module fail. However, each system will incur the costs of a spare processor for the return of avoiding a
repair in the small proportion of those that need repair. This is usually not justified from a system cost
perspective. Rather it is better to invest in greater processor reliability.
On scale-out systems redundancy is generally only implemented on items where the cost is relatively low,
and the failure rates expected are relatively high – and in some cases where the redundancy is not
complete. For example, power supplies and fans may be considered redundant in some scale-out
systems because when one fails, operation will continue. However, depending on the design, when a
component fails, fans may have to be run faster, and performance-throttled until repair.
On scale-up systems redundancy that might even add significantly to maintenance costs is considered
worthwhile to avoid indirect costs associated with downtime, as discussed below.

End User Costs


Generally, of greater concern are the indirect costs that end users will incur whenever a service action
needs to be taken. The costs are the least when a component fails and can be concurrently replaced, and
the highest when a component fails such that the system goes down and must stay down until repaired.
The indirect cost typically depends on the importance of the workloads running in the system and what
sort of mechanisms exist to cope with the fault. If an application is distributed across multiple systems and
there is little to no impact to the application when a single system goes down, then the cost is relatively
low and there is less incentive to invest in RAS. If there is some application downtime in such an
environment, or at least a reduction in workload throughput, then the cost associated with the downtime
can be quantified and the need for investing in greater RAS can be weighed against those costs.
The previous section discussed how the highest levels of availability are typically achieved even in scale-
up environments by having multiple systems and some means of failing over to another in the event of a
problem. Such failovers, even when relatively short duration can have costs. In such cases, investing in
RAS at the server level can pay particular dividends.

POWER9 and Power10 processor-based systems RAS Page


76
When a failover solution is not employed, the impact to workload is obvious. Even in cases where failover
is enabled, not every application may be considered critical enough to have an automated failover means.
The impact of a prolonged service outage on these applications can be significant.
It is also significant to note that even when plans are put into place for automated failover when a fault
occurs, there are times when failover does not happen as smoothly as expected. These are typically due
to multiple factors, unavailability of the backup system, unintended code level mismatches, insufficient
testing, and so forth.
Perhaps the less reliable the system, the more often the failover mechanisms might be tested, but also
the more likely that some reason for an incident to occur in failover leading to an extended outage. The
latter suggests also that investing in both hardware reliability and failover testing can be beneficial.
Measuring Availability

Measuring Availability
Mathematically speaking, availability is often expressed as a percentage of the time something is
available or in use over a given period of time. An availability number for a system can be mathematically
calculated from the expected reliability of the system so long as both the mean time between failures and
the duration of each outage is known.
For example: Consider a system that always runs exactly one week between failures and each time it
fails, it is down for 10 minutes. For the 168 hours in a week, the system is down (10/60) hours. It is up
168hrs – (10/60) hrs. As a percentage of the hours in the week, it can be said that the system is (168-
(1/6))*100% = 99.9% available.
99.999% available means approximately 5.3 minutes down in a year. On average, a system that failed
once a year and was down for 5.3 minutes would be 99.999% available. This is often called 5 9’s of
availability.
When talking about modern server hardware availability, short weekly failures like in the example above
is not the norm. Rather the failure rates are much lower and the mean time between failures (MTBF) is
often measured in terms of years – perhaps more years than a system will be kept in service.
Therefore, when a MTBF of 10 years, for example, is quoted, it is not expected that on average each
system will run 10 years between failures. Rather it is more reasonable to expect that on average, in a
given year, one server out of ten will fail. If a population of ten servers always had exactly one failure a
year, a statement of 99.999% availability across that population of servers would mean that the one
server that failed would be down about 53 minutes when it failed.
In theory 5 9’s of availability can be achieved by having a system design which fails frequently, multiple
times a year, but whose failures are limited to very small periods of time. Conversely 5 9’s of available
might mean a server design with a very large MTBF, but where a given server takes a fairly long time to
recover from the very rare outage.

POWER9 and Power10 processor-based systems RAS Page


77
Figure 40: Different MTBFs but same 5 9’s of availability

5 9's Of Availability

Failures per 100


Systems in a
Year

Average Time
Down Per
Failure
(Minutes)
1 2 5 15 10 20 30 50 100
Mean Time Between Failures (Years)

The figure above shows that 5 9’s of availability can be achieved with systems that fail frequently for
miniscule amounts of time, or very infrequently with much larger downtime per failure.
The figure is misleading in the sense that servers with low reliability are likely to have many components
that, when they fail, take the system down and keep the system down until repair. Conversely servers
designed for great reliability often are also designed so that the systems, or at least portions of the
system can be recovered without having to keep a system down until repaired.
Hence on the surface systems with low MTBF would have longer repair-times and a system with 5 9’s of
availability would therefore be synonymous with a high level of reliability.
However, in quoting an availability number, there needs to be a good description of what is being quoted.
Is it only concerning unplanned outages that take down an entire system? Is it concerning just hardware
faults, or are firmware, OS and application faults considered?
Are applications even considered? If they are, if multiple applications are running on the server, is each
application outage counted individually? Or does one event causing multiple application outages count as
a single failure?
If there are planned outages to repair components either delayed after an unplanned outage, or
predictively, is that repair time included in the unavailability time? Or are only unplanned outages
considered?
Perhaps most importantly when reading that a certain company achieved 5 9’s of availability for an
application - is knowing if that number counted application availability running in a standalone
environment? Or was that a measure of application availability in systems that might have a failover
capability.

Contributions of Each Element in the Application Stack


When looking at application availability it is apparent that there are multiple elements that could fail and
cause an application outage. Each element could have a different MTBF and the recovery time for
different faults can also be different.
When an application crashes, the recovery time is typically just the amount of time it takes to detect that
the application has crashed, recover any data if necessary to do so, and restart the application.

POWER9 and Power10 processor-based systems RAS Page


78
When an operating system crashes and takes an application down, the recovery time includes all the
above, plus the time it takes for the operating system to reboot and be ready to restart the application.
An OS vendor may be able to estimate a MTBF for OS panics based on previous experience. The OS
vendor, however, can’t really express how many 9s of availability will result for an application unless the
OS vendor really knows what application a customer is deploying, and how long its recovery time is.
Even more difficulty can arise with calculating application availability due to the hardware.
For example, suppose a processor has a fault. The fault might involve any of the following:
1. Recovery or recovery and repair that causes no application outage.
2. An application outage and restart but nothing else
3. A partition outage and restart.
4. A system outage where the system can reboot and recover, and the failing hardware can
subsequently be replaced without taking another outage
5. Some sort of an outage where reboot and recovery is possible, but a separate outage will
eventually be needed to repair the faulty hardware.
6. A condition that causes an outage, but recovery is not possible until the failed hardware is
replaced; meaning that the system and all applications running on it are down until the
repair is completed.
The recovery times for each of these incidents is typically progressively longer, with the final case very
dependent on how quickly replacement parts can be procured and repairs completed.
Figure is an example with hypothetical failure rates and recovery times for the various situations
mentioned above, looking at a large population of standalone systems each running a single application.

POWER9 and Power10 processor-based systems RAS Page


79
Figure 41: Hypothetical Standalone System Availability Considerations
Application Reboot Reboot of Hardware Time to Time to
restart and of OS Hypervisor "Crash" and Repair aquire failed
recovery and re- reboot to point hardware in a parts and
establishme of being able system if part have at
Activities/Elements Involved in
nt of to load already system
Recoveries
partitions Hypervisor identified and ready to
including fault available and begin repair
diagnostic time repair is not
concurrent
Average Recovery Time Per Element
7 4 5 10 30 180
(Minutes)

Total
Mean time to Recovery Minutes
Outage (in minutes/Incid Down Per Associated
Outage Reason Years) Recovery Activities Needed ent Year Availabilty
Fault Limited To
3 x 7.00 2.33 99.99956%
Application
Fault Causing OS crash
10 x x 11.00 1.10 99.99979%
Fault causing hypervisor
80 x x x 16.00 0.20 99.99996%
crash
Fault impacting system
(crash) but system
recovers on reboot with 80 x x x x 26.00 0.33 99.99994%
enough resources to
restart application
Planned hardware repair
for hw fault (where initial
fault impact could be 70 x x x x x 56.00 0.80 99.99985%
any of the above)

Fault impacting system


where application is
500 x x x x x x 236.00 0.47 99.99991%
down until system is
repaired
Total For Application from all causes 5.23 99.99901%

This is not intended to represent any given system. Rather, it is intended to illustrate how different
outages have different impacts. An application crash only requires that the crash be discovered, and the
application restarted. Hence there is only an x in the column for the 7 minutes application restart and
recovery time.
If an application is running under an OS and the OS crashes, then the total recovery time must include
the time it takes to reboot the OS plus the time it takes to detect the fault and recover the application after
the OS reboots. In the example with an x in each of the first two columns the total recovery time is 11
minutes (4 minutes to recover the OS and 7 for the application.)
The worst-case scenario as described in the previous section is a case where the fault causes a system
to go down and stay down until repaired. In the example, with an x in all the recovery activities column,
that would mean 236 minutes of recovery for each such incident.
In the hypothetical example numbers were chosen to illustrate 5 9s of availability across a population of
systems.
This required that the worst-case outage scenarios to be extremely rare compared to the application only-
outages.
In addition, the example presumed that:
1. All the software layers can recover reasonably efficiently even from entire system
crashes.
2. There were no more than a reasonable number of applications driven and operating
system driven outages.
3. A very robust hypervisor is used, expecting it to be considerably more robust than the
application hosting OS.
4. Exceptionally reliable hardware is used. (The example presumes about 70 MTBF for
hardware faults.)

POWER9 and Power10 processor-based systems RAS Page


80
5. Hardware that can be repaired efficiently, using concurrent repair techniques for most of
the faults.
6. As previously mentioned, the system design is intended that few faults exist that could
keep a system down until repaired. In the rare case that such a fault does occur it
presumes an efficient support structure that can rapidly deliver the failed part to the
failing system and efficiently make the repair.
7. A method of insuring quick restart of a system after hardware outages, which might
impact the ability to do extensive fault analysis.
It must also be stressed that the example only looks at the impact of hardware faults that caused some
sort of application outage. It does not deal with outages for hardware or firmware upgrades, patches, or
repairs for failing hardware that have not caused outages.
The ability to recover from outages may also assume that some failed hardware may be deconfigured to
allow the application to be restarted in cases where restart is possible. The system would need to be
configured in such a way that there are sufficient resources after a deconfiguration for the application to
restart and perform useful work.
The hypothetical example also presumes no significant time to diagnose the fault and planned hardware
repair times. It may be more typical that some significant time is taken to collect extended error data
concerning a failure. These numbers assume that action is taken not to collect such data and that the
built-in error/detection fault isolation capabilities are generally sufficient to isolate the fault to a fault
domain. It is also common when talking about availability to not consider any planned outages for repair.

Critical Application Simplification


Under a single operating system instance, it is possible to run multiple applications of various kinds,
though typically only one important application is deployed. Likewise, in a system with multiple operating
system partitions, it is possible to run multiple applications using multiple partitions.
To calculate the availability of each such application, throughout an entire system, calculations would
have to change to account for the number of partitions and the outage time associated with each for each
fault that could be experienced. The aggregate availability percentage would represent an aggregation of
many different applications, not all of equal importance.
Therefore, in the examples in this section, a simplifying assumption is made that each server is running a
single application of interest to availability calculations. In other words, the examples look at availability of
a critical applications presuming one such application per system.
Measuring Application Availability in a Clustered Environment
It should be evident that clustering can have a big impact on application availability since, if recovery time
after an outage is required for the application, the time for nearly every outage can be reliably predicted
and limited to just that “fail-over” time.
Figure shows what might be achieved with the earlier proposed hypothetical enterprise hardware
example in such a clustered environment.

POWER9 and Power10 processor-based systems RAS Page


81
Figure 42: Ideal Clustering with Enterprise-Class Hardware Example
Time to detect outage in
1
minutes
Time Required To
Restart/Recover 5
Application in minutes
Total
Recovery Minutes
Mean time to Outage (in minutes/Incide Down Per Associated
Outage Reason Years) nt Year Availability
Fault Limited To Application 3 6.00 2.00 99.99962%
Fault Causing OS crash 10 6.00 0.60 99.99989%
Fault causing hypervisor crash
80 6.00 0.08 99.99999%
Fault impacting system (crash)
but system recovers on reboot
with enough resources to restart 80 6.00 0.08 99.99999%
application

Planned hardware repair for hw


fault (where initial fault impact
70 6.00 0.09 99.99998%
could be any of the above)

Fault impacting system where


application is down until system 500 6.00 0.01 100.00000%
is repaired
Total for for all Outage Reasons 2.85 99.99946%

Similar to the single-system example, it shows the unavailability associated with various failure types.
However, it presumes that application recovery occurs by failing over from one system to another. Hence
the recovery time for any of the outages is limited to the time it takes to detect the fault and fail-over and
recover on another system. This minimizes the impact of faults that in the standalone case, while rare,
would lead to extended application outages.
The example suggests that fail-over clustering can extend availability beyond what would be achieved in
the standalone example.

Recovery Time Caution


The examples given presume that it takes about six minutes to recover from any system outage. This
may be realistic for some applications, but not for others. As previously shown, doubling the recovery time
has a corresponding large impact on the availability numbers.
Whatever the recovery time associated with such a fail-over event is for a given application needs to be
very well understood. Such a HA solution is really no HA solution at all if a service level agreement
requires that no outage exceed, for example, 10 minutes, and the HA fail-over recovery time is 15
minutes.

Clustering Infrastructure Impact on Availability


A clustering environment might be deployed where a primary server runs with everything it needs under
the covers, including maintaining all the storage, data and otherwise, within the server itself. This is
sometimes referred to as a “nothing shared” clustering model.
In such a case, clustering involves copying data from one server to the backup server, with the backup
server maintaining its own copy of the data. The two systems communicate by a LAN and redundancy is
achieved by sending data across the redundant LAN environment.

POWER9 and Power10 processor-based systems RAS Page


82
It might be expected in such a case that long outages would only happen in the relatively unlikely
scenarios where both servers are down simultaneously, both LAN adapters are down simultaneously, or
there is a bug in the failover itself.
Hardware, software and maintenance practices must work together to achieve the desired high
availability level.
The “nothing shared” scenario above does not automatically support the easy migration of data from one
server to another for planned events that don’t involve a failover.
An alternative approach makes use of a shared common storage such as a storage area network (SAN),
where each server has accessed to and only makes use of the shared storage.

Real World Fail-over Effectiveness Calculations


The previous examples do also presume that such high availability solutions are simple enough to
implement that they can be used for all applications and that fail-over, when initiated, always works.
In the real world that may not always be the case. As previously mentioned, if the secondary server is
down for any reason when a failover is required, then failover is not going to happen and the application
in question is going to be down until either the primary or secondary server is restored. The frequency of
such an event happening is directly related to the underlying availability characteristics of each individual
server – the more likely the primary server is to fail, the more likely it needs to have the backup available
and the more likely the backup server is to be down, the less likely it will be available when needed.
Any necessary connections between the two must also be available, and that includes any shared
storage.
It should be clear that if the availability of an application is to meet 5 9s of availability, then shared storage
must have better than 5 9s of availability.
Different kinds of SAN solutions may have enough redundancy of hard drives to for greater than 5 9s
availability of data on the drives but may fail in providing availability of the I/O and control that provide
access to the data.
The most highly available SAN solutions may require redundant servers under the cover using their own
form of fail-over clustering to achieve the availability of the SAN.
Advanced HA solutions may also take advantage of dual SANs to help mitigate against SAN outages,
and support solutions across multiple data centers. This mechanism effectively removes the task of
copying data from the servers and puts it on the SAN.
Such mechanism can be very complicated both in the demands of the hardware and in the controller
software.
The role of software layers in cluster availability is crucial, and it can be quite difficult to verify that all the
software deployed is bug-free. Every time an application fails, it may fail at a different point. Ensuring
detection of every hardware fault and data-lossless recovery of an application for every possible failure is
complicated by the possibility of exercising different code paths on each failure. This difficulty increases
the possibility of latent defects in the code. There are also difficulties in making sure that the fail-over
environment is properly kept up to date and in good working order with software releases that are
consistently patched and known to be compatible.
To verify proper working order of a fail-over solution as regards to all of the above, testing may be
worthwhile, but live testing of the fail-over solution itself can add to application unavailability.
When the fail-over solution does not work as intended, and especially when something within the
clustering infrastructure goes wrong, the recovery time can be long. Likewise if a primary server fails
when the secondary is not available, or if the state of the transactions and data from the one server
cannot be properly shadowed, recovery can include a number of long-downtime events such as waiting
on a part to repair a server, or the time required to rebuild or recover data.
A good measurement of availability in a clustered environment should therefore include a factor for the
efficiency of the fail-over solution implemented; having some measure for how frequently a fail-over fails
and how long it takes to recover from that scenario.

POWER9 and Power10 processor-based systems RAS Page


83
In the figures below, the same Enterprise and Non-Enterprise clustering examples are evaluated with an
added factor that one time in twenty a fail-over event doesn’t go as planned and recovery from such
events takes a number of hours.

Figure 43: More Realistic Model of Clustering with Enterprise-Class Hardware


Time to detect outage in Ratio of Problem How Long For
minutes Failover Events Application
1 to Succcessful Recovery after a
Events Problem Failover
Time Required To
Restart/Recover 5 1:20 60
Application in minutes
Total Additional Time
Recovery Minutes Mean Time to To Account for
Mean time to Outage (in minutes/Incide Down Per Fail-over Issue Fail-over issues Total Minutes Availablity Associated
Outage Reason Years) nt Year (years) (minutes) Down Per Year with Fault Type
Fault Limited To Application 3 6.00 2.00 60 1 3.00 99.99943%
Fault Causing OS crash 10 6.00 0.60 200 0.3 0.90 99.99983%
Fault causing hypervisor crash
80 6.00 0.08 1600 0.0375 0.11 99.99998%
Fault impacting system (crash)
but system recovers on reboot
with enough resources to restart 80 6.00 0.08 1600 0.0375 0.11 99.99998%
application

Planned hardware repair for hw


fault (where initial fault impact
70 6.00 0.09 1400 0.042857143 0.13 99.99998%
could be any of the above)

Fault impacting system where


application is down until system 500 6.00 0.01 10000 0.006 0.02 100.00000%
is repaired
Total Minutes of downtime per year 4.27
Availability 99.99919%

Figure 44: More Realistic Clustering with Non-Enterprise-Class Hardware


Time to detect outage in Ratio of Problem How Long For
minutes Failover Events Application
1 to Succcessful Recovery after a
Events Problem Failover
Time Required To
Restart/Recover 5 1:20 150
Application in minutes
Total Additional Time
Recovery Minutes Mean Time to To Account for
Mean time to Outage (in minutes/Incide Down Per Fail-over Issue Fail-over issues Total Minutes Availablity Associated
Outage Reason Years) nt Year (years) (minutes) Down Per Year with Fault Type
Fault Limited To Application 3 6.00 2.00 60 2.5 4.50 99.99914%
Fault Causing OS crash 10 6.00 0.60 200 0.75 1.35 99.99974%
Fault causing hypervisor crash
30 6.00 0.20 600 0.25 0.45 99.99991%
Fault impacting system (crash)
but system recovers on reboot
with enough resources to restart 40 6.00 0.15 800 0.1875 0.34 99.99994%
application

Planned hardware repair for hw


fault (where initial fault impact
40 6.00 0.15 800 0.1875 0.34 99.99994%
could be any of the above)

Fault impacting system where


application is down until system 40 6.00 0.15 800 0.1875 0.34 99.99994%
is repaired
Total Minutes of downtime per year 7.31
Availability 99.99861%

The example presumes somewhat longer recovery for the non-enterprise hardware due to the other kinds
of real-world conditions described in terms of parts acquisition, error detection/fault isolation (ED/FI) and
so forth.

POWER9 and Power10 processor-based systems RAS Page


84
Though these examples presume too much to be specifically applicable to any given customer
environment, they are intended to illustrate two things:
The less frequently the hardware fails, the better the ideal availability, and the less perfect clustering must
be to achieve desired availability.
If the clustering and failover support elements themselves have bugs/pervasive issues, or single points of
failure besides the server hardware, less than 5 9s of availability (with reference to hardware faults) may
still occur in a clustered environment. It is possible that availability might be worse in those cases than in
comparable stand-alone environment.

Reducing the Impact of Planned Downtime in a Clustered Environment


The previous examples did not look at the impact of planned outages except for deferred repair of a part
that caused some sort of outage.
Planned outages of systems may in many cases occur more frequently than unplanned outages. This is
especially true of less-than-enterprise class hardware that:
1. Require outages to patch code (OS, hypervisor, etc.)
2. Have no ability to repair hardware using integrated sparing and similar
techniques and instead must predictively take off-line components that may
otherwise subsequently fail and require an outage to repair.
3. Do not provide redundancy and concurrent repair of components like I/O
adapters.
In a clustered environment, it seems reasonable that when it is known in advance that a system needs to
be taken down during a planned maintenance window, that recovery and fail-over times could be
minimized with some advanced planning. Still, so long as fail-over techniques are used for the planned
outages, one should still expect recovery time to be in the minutes range.
However, it is also possible to take advantage of a highly virtualized environment to migrate applications
from one system to another in advance of a planned outage, without having to recover/restart the
applications.
PowerVM Enterprise Edition™ offers one such solution called Live Partition Mobility (LPM). In addition to
use in handling planned hardware outages, LPM can mitigate downtime associated with hardware and
software upgrades and system reconfiguration and other such activities which could also impact
availability and are otherwise not considered in even the “real world” 5 9s availability discussion.
HA Solutions Cost and Hardware Suitability

Clustering Resources
One of the obvious disadvantages of running in a clustered environment, as opposed to a standalone
system environment, is the need for additional hardware to accomplish the task.
An application running full-throttle on one system, prepared to failover on another, needs to have a
comparable capability (available processor cores, memory and so forth) on that other system.
There does not need to be exactly one back-up server for every server in production, however. If multiple
servers are used to run work-loads, then only a single backup system with enough capacity to handle the
workload of any one server might be deployed.
Alternatively, if multiple partitions are consolidated on multiple servers, then presuming that no server is
fully utilized, fail-over might be planned so that one failing server will restart on different partitions on
multiple different servers.
When an enterprise has sufficient workload to justify multiple servers, either of these options reduces the
overhead for clustering.

POWER9 and Power10 processor-based systems RAS Page


85
Figure 45: Multi-system Clustering Option

Application Layer Application Layer

Hypervisor/Virtualization Hypervisor/Virtualization
Computational Virtualized LAN Adapters LAN Adapters Virtualized Computational
Hardware I/O Servers I/O Servers Hardware
SAN Adapters SAN Adapters

Infrastructure Infrastructure

Shared Storage
Application Layer Using a SAN Application Layer
Shared Storage
Hypervisor/Virtualization Using a SAN Hypervisor/Virtualization
Computational Virtualized SAN Adapters LAN Adapters Virtualized Computational
Hardware I/O Servers I/O Servers Hardware
LAN Adapters SAN Adapters

Infrastructure Infrastructure

SAN Adapters

LAN Adapters

Hypervisor/Virtualization

Application Layer
Infrastructure

I/O Servers Hardware


Virtualized Computational

There are several additional variations that could be considered.


In practice there is typically a limit as to how many systems are clustered together. These include
concerns about increased risk of simultaneous server outages, relying too much on the availability of
shared storage where shared storage is used, the overhead of keeping data in sync when shared storage
is not used, and practical considerations ensuring that where necessary all systems aligned in terms of
code-updates, hardware configuration and so forth.
It many cases it should also be noted that applications may have licensing terms that make clustering
solutions more expensive. For example, applications, databases etc. may be licensed on a per core
basis. If the license is not based on how many cores are used at any given time, but on how many
specific cores in specific systems the application might run on, then clustering increases licensing cost.

Using High Performance Systems


In these environments, it becomes not only useful to have a system that is reliable, but also capable of
being highly utilized. The more the processors and memory resources of a system are used, the fewer
total system resources are required and that impacts the cost of backup server resources and licenses.
Power Systems are designed with per core performance in mind and highly utilized. Deploying PowerVM
allows for a great depth in virtualization allowing applications to take the most advantage of the
processing, I/O and storage capabilities. Power Systems thus have a natural affinity towards use in
clustered environments where maximizing resources is important in reducing ownership costs.

POWER9 and Power10 processor-based systems RAS Page


86
Cloud Service Level Agreements and Availability
The preceding analysis establishes some means of understanding application availability regardless of
where applications are hosted – on premises, by a cloud provider, or using a hybrid approach.
It requires a thorough understanding of the underlying hardware capabilities, failover or remote restart
capabilities, and HA applications. This may be difficult to achieve in all circumstances.
For example, providers of cloud services may offer service level agreements (SLA) for the cloud services
they provide. An SLA may have availability guarantees. Such SLAs typically pertain to availability within a
given month with a failure to meet an SLA in a month providing a credit for the next month’s services.
A typical SLA provides several tiers of availability, say 10% credit if a 99.95 % availability isn’t maintained
or 30% credit if 99% is not maintained and 100% credit if a level of 96% isn’t achieved.

Figure 46: Hypothetical Service Level Agreement

*Where “down” and availability refer to the service period provided, not the customer application.

In understanding the service level agreement, what the “availability of the service” means is critical to
understanding the SLA.
Presuming the service is a virtual machine consisting of certain resources (processor cores/memory)
these resources would typically be hosted on a server. Should a failure occur which terminates
applications running on the virtual machine, depending on the SLA, the resources could be switched to a
different server.
If switching the resources to a different server takes no more than 4.38 minutes and there is no more than
a single failure in a month, then the SLA of 99.99% would be met for the month.
However, such an SLA might take no account of how disruptive the failure to the application might be.
While the service may be down for a few minutes it could take the better part of an hour or longer to
restore the application being hosted. While the SLA may say that the service achieved 99.99% availability
in such a case, application availability could be far less.
Consider the case where an application hosted on a virtual machine (VM) with a 99.99% availability for
the VM. To achieve the SLA, the VM would need to be restored in no more than about 4.38 minutes. This
typically means being restored to a backup system.
If the application takes 100 minutes to recover after a new VM is made available (for example), the
application availability would be more like 99.76% for that month.

POWER9 and Power10 processor-based systems RAS Page


87
Figure 47: Hypothetical Application Downtime meeting a 99.99% SLA

Downtime
Actual if one Actual Downtime Actual
Downtime Availability outage in availability if 12 Availability
Outage in for the a year for that outages for the
Reasons minutes Month (minutes) year in a year Year
Service
4.38 4.38 52.56
Outage
Downtime for
application to
restore on 100.00 100.00 1200
new virtual
machine
Total
Downtime
(Sum of
Service 104.38 99.76% 104.38 99.98% 1252.56 99.76%
outage +
application
recovery)
If that were the only outage in the year, the availability across the year would be around 99.99%.
The SLA, however, could permit a single outage every month.
In such a case with the application downtime typically orders of magnitude higher than the server outage
time, even an SLA of 99.99% Availability or 4.38 minutes per month will prove disruptive to critical
applications even in a cloud environment.
The less available the server is, the more frequent the client application restarts are possible. These
more frequent restarts mean the client do not have access to their applications and its effect is
compounded over time.
In such situations the importance of using enterprise class servers for application availability can’t be
understood just by looking at a monthly service level agreement.
To summarize what was stated previously, it is difficult to compare estimates or claims of availability
without understanding specifically:
1. What kind of failures are included (unplanned hardware only, or entire stack)?
2. What is the expected meantime between failures and how is it computed (monthly SLA,
an average across multiple systems over time, etc.)?
3. What is done to restore compute facilities in the face of a failure and how recovery time
is computed?
4. What is expected of both hardware and software configuration to achieve the availability
targets?
5. And for actual application availability, what the recovery time of the application is given
each of the failure scenarios?

POWER9 and Power10 processor-based systems RAS Page


88
Section 5: Serviceability
The purpose of serviceability is to efficiently repair the system while attempting to minimize or eliminate
impact to system operation. Serviceability includes new system installation, Miscellaneous Equipment
Specification (MES) which involves system upgrades/downgrades, and system maintenance/repair.
Based on the system warranty and maintenance contract, service may be performed by the client, an IBM
representative, or an authorized warranty service provider.

The serviceability features delivered in IBM Power system ensure a highly efficient service environment
by incorporating the following attributes:

• Design for SSR Setup, Install, and Service


• Error Detection and Fault Isolation (ED/FI)
• First Failure Data Capture (FFDC)
• Light path service indicators
• Service and FRU labels on the systems
• Service procedures documented in IBM Documentation and available on the HMC
• Automatic reporting of serviceable events to IBM through the Electronic Service Agent Call Home
application

Service Environment
In the PowerVM environment, the HMC is a dedicated server that provides functions for configuring and
managing servers for either logical partitioned or full-system partition using a GUI or command-line
interface (CLI) or REST API. An HMC attached to the system enables support personnel (with client
authorization) to remotely or locally login to review error logs and perform remote maintenance if required.

Power systems support multiple service environment options:

• HMC Attached - one or more HMCs or vHMCs are supported by the system with PowerVM. This
is the default configuration for servers supporting logical partitions with dedicated or virtual I/O. In
this case, all servers have at least one logical partition.
• HMC less - There are two service strategies for non-HMC managed systems.
1. Full-system partition with PowerVM: A single partition owns all the server resources
and only one operating system may be installed. The primary service interface is through
the operating system and the service processor.
2. Partitioned system with NovaLink: In this configuration, the system can have more
than one partition and can be running more than one operating system. The primary
service interface is through the service processor.

Service Interface
Support personnel can use the service interface to communicate with the service support applications in a
server using an operator console, a graphical user interface on the management console or service
processor, or an operating system terminal. The service interface helps to deliver a clear, concise view of
available service applications, helping the support team to manage system resources and service
information in an efficient and effective way. Applications available through the service interface are
carefully configured and placed to give service providers access to important service functions.

Different service interfaces are used, depending on the state of the system, hypervisor, and operating
environment. The primary service interfaces are:

POWER9 and Power10 processor-based systems RAS Page


89
• LEDs
• Operator Panel
• BMC Service Processor menu
• Operating system service menu
• Service Focal Point on the HMC or vHMC with PowerVM

In the light path LED implementation, the system can clearly identify components for replacement by
using specific component-level LEDs and can also guide the servicer directly to the component by
signaling (turning on solid) the enclosure fault LED, and component FRU fault LED. The servicer can
also use the identify function to blink the FRU-level LED. When this function is activated, a roll-up to the
blue enclosure identify LED will occur to identify an enclosure in the rack. These enclosure LEDs will turn
on solid and can be used to follow the light path from the enclosure and down to the specific FRU in the
PowerVM environment.

First Failure Data Capture and Error Data Analysis


First Failure Data Capture (FFDC) is a technique that helps ensure that when a fault is detected in a
system, the root cause of the fault will be captured without the need to re-create the problem or run any
sort of extended tracing or diagnostics program. For the vast majority of faults, a good FFDC design
means that the root cause can also be determined automatically without servicer or human intervention.

FFDC information, error data analysis, and fault isolation are necessary to implement the advanced
serviceability techniques that enable efficient service of the systems and to help determine the failing
items.

In the rare absence of FFDC and Error Data Analysis, diagnostics are required to re-create the failure and
determine the failing items.

Diagnostics
General diagnostic objectives are to detect and identify problems so they can be resolved quickly.
Elements of IBM's diagnostics strategy is to:

• Provide a common error code format equivalent to a system reference code with PowerVM,
system reference number, checkpoint, or firmware error code.
• Provide fault detection and problem isolation procedures. Support remote connection capability
that can be used by the IBM Remote Support Center or IBM Designated Service.
• Provide interactive intelligence within the diagnostics, with detailed online failure information,
while connected to IBM's back-end system.

Automated Diagnostics
The processor and memory FFDC technologies are designed to perform without the need for problem re-
creation nor the need for user intervention. The firmware runtime diagnostics code leverages these
hardware fault isolation facilities to accurately determine system problems and to take the appropriate
actions. Most solid and intermittent errors can be correctly detected and isolated, at the time the failure
occurs that is whether during runtime or boot-time. In the few situations that automated system
diagnostics cannot decipher the root cause of an issue, service support intervention is required.

Stand-alone Diagnostics
As the name implies, stand-alone or user-initiated diagnostics requires user intervention. The user must
perform manual steps, which may include:

POWER9 and Power10 processor-based systems RAS Page


90
• Booting from the diagnostics CD, DVD, USB, or network
• Interactively selecting steps from a list of choices

Concurrent Maintenance
The determination of whether a firmware release can be updated concurrently is identified in the readme
information file that is released with the firmware. An HMC is required for the concurrent firmware update
with PowerVM. In addition, as discussed in more details in other sections of this document, concurrent
maintenance of PCIe adapters and NVMe drives are supported with PowerVM. Power supplies, fans and
op panel LCD are hot pluggable as well.

Service Labels
Service providers use these labels to assist them in performing maintenance actions. Service labels are
found in various formats and positions and are intended to transmit readily available information to the
servicer during the repair process. Following are some of these service labels and their purpose:

• Location diagrams: Location diagrams are located on the system hardware, relating information
regarding the placement of hardware components. Location diagrams may include location
codes, drawings of physical locations, concurrent maintenance status, or other data pertinent to a
repair. Location diagrams are especially useful when multiple components such as DIMMs,
processors, fans, adapter cards, and power supplies are installed.
• Remove/replace procedures: Service labels that contain remove/replace procedures are often
found on a cover of the system or in other spots accessible to the servicer. These labels provide
systematic procedures, including diagrams detailing how to remove or replace certain serviceable
hardware components.
• Arrows: Numbered arrows are used to indicate the order of operation and the serviceability
direction of components. Some serviceable parts such as latches, levers, and touch points need
to be pulled or pushed in a certain direction and in a certain order for the mechanical mechanisms
to engage or disengage. Arrows generally improve the ease of serviceability.

QR Labels
QR labels are placed on the system to provide access to key service functions through a mobile device.
When the QR label is scanned, it will go to a landing page for Power10 processor-based systems. The
landing page contains links to each MTM service functions and its useful to a servicer or operator
physically located at the machine. The service functions include things such as installation and repair
instructions, reference code look up, and so on.

Packaging for Service


The following service features are included in the physical packaging of the systems to facilitate service:

• Color coding (touch points): Blue-colored touch points delineate touchpoints on service
components where the component can be safely handled for service actions such as removal or
installation.
• Tool-less design: Selected IBM systems support tool-less or simple tool designs. These designs
require no tools or simple tools such as flathead screw drivers to service the hardware
components.

POWER9 and Power10 processor-based systems RAS Page


91
• Positive retention: Positive retention mechanisms help to assure proper connections between
hardware components such as cables to connectors, and between two cards that attach to each
other. Without positive retention, hardware components run the risk of becoming loose during
shipping or installation, preventing a good electrical connection. Positive retention mechanisms
like latches, levers, thumbscrews, pop Nylatches (U-clips), and cables are included to help
prevent loose connections and aid in installing (seating) parts correctly. These positive retention
items do not require tools.

Error Handling and Reporting


In the event of system hardware or environmentally induced failure, the system runtime error capture
capability systematically analyzes the hardware error signature to determine the cause of failure. The
analysis result will be stored in system NVRAM. When the system can be successfully restarted either
manually or automatically, or if the system continues to operate, the error will be reported to the operating
system. Hardware and software failures are recorded in the system error log filesystem.

When an HMC is attached in the PowerVM environment, an ELA routine analyzes the error, forwards the
event to the Service Focal Point (SFP) application running on the HMC, and notifies the system
administrator that it has isolated a likely cause of the system problem. The service processor event log
also records unrecoverable checkstop conditions, forwards them to the SFP application, and notifies the
system administrator.

The system has the ability to call home through the operating system to report platform-recoverable errors
and errors associated with PCIe adapters/devices.

In the HMC-managed environment, a call home service request will be initiated from the HMC and the
pertinent failure data with service parts information and part locations will be sent to an IBM service
organization. Customer contact information and specific system related data such as the machine type,
model, and serial number, along with error log data related to the failure, are sent to IBM Service.

Call Home
Call home refers to an automatic or manual call from a client location to the IBM support structure with
error log data, server status, or other service-related information. Call home invokes the service
organization in order for the appropriate service action to begin. Call home can be done through the
Electronic Service Agent (ESA) imbedded in the HMC or through a version of ESA imbedded in the
operating systems for non-HMC managed or a version of ESA that runs as a standalone call home
application. While configuring call home is optional, clients are encouraged to implement this feature in
order to obtain service enhancements such as reduced problem determination and faster and potentially
more accurate transmittal of error information. In general, using the call home feature can result in
increased system availability. See the next section for specific details on this application.

IBM Electronic Services


Electronic Service Agent (ESA) and Client Support Portal (CSP) comprise IBM Electronic Services
solution, which is architected for providing fast, exceptional support to IBM clients. IBM ESA is a no-
charge tool that proactively monitors and reports hardware events such as system errors and collects
hardware and software inventory. ESA can help customers focus on their core company business
initiatives, save time, and spend less effort in managing their day-to-day IT maintenance issues. In
addition, Call Home Cloud Connect Web and Mobile capability extends the common solution and offers
IBM Systems related support information applicable to Servers and Storage.
Details are available here - IBM Client Vantage

POWER9 and Power10 processor-based systems RAS Page


92
System configuration and inventory information collected by ESA also can be used to improve problem
determination and resolution between the client and the IBM support team. As part of an increased focus
to provide even better service to IBM clients, ESA tool configuration and activation comes standard with
the system. In support of this effort, an HMC External Connectivity security whitepaper has been
published, which describes data exchanges between the HMC and the IBM Service Delivery Center
(SDC) and the methods and protocols for this exchange. To read the whitepaper and prepare for ESA
installation, see the "Security" section at IBM Electronic Service Agent

Benefits of ESA
• Increased Uptime: ESA is designed to enhance the warranty and maintenance service by
potentially providing faster hardware error reporting and uploading system information to IBM
Support. This can optimize the time monitoring the symptoms, diagnosing the error, and manually
calling IBM Support to open a problem record. And 24x7 monitoring and reporting means no
more dependency on human intervention or off-hours client personnel when errors are
encountered in the middle of the night.
• Security: The ESA tool is designed to help secure the monitoring, reporting, and storing of the
data at IBM. The ESA tool is designed to help securely transmit through the internet (HTTPS) to
provide clients a single point of exit from their site. Initiation of communication is one way.
Activating ESA does not enable IBM to call into a client's system. For additional information, see
the IBM Electronic Service Agent website.
• More Accurate Reporting: Because system information and error logs are automatically
uploaded to the IBM Support Center in conjunction with the service request, clients are not
required to find and send system information, decreasing the risk of misreported or misdiagnosed
errors. Once inside IBM, problem error data is run through a data knowledge management
system, and knowledge articles are appended to the problem record.

Remote Code Load (RCL)


The HMC 1030 release supports remote code load for Power systems firmware. It’s a feature to upgrade
or update code by a remote support engineer. RCL is supported with the Expert Care Premium package.
For more details, please refer to the IBM Remote Code Load website.

Client Support Portal


Client Support Portal is a single internet entry point that replaces the multiple entry points traditionally
used to access IBM Internet services and support. This web portal enables you to gain easier access to
IBM resources for assistance in resolving technical problems.

This web portal provides valuable reports of installed hardware and software using information collected
from the systems by IBM Electronic Service Agent. Reports are available for any system associated with
the customer's IBM ID.

For more information on how to utilize client support portal, visit the following website: Client Support
Portal or contact an IBM Systems Services Representative (SSR).

POWER9 and Power10 processor-based systems RAS Page


93
Summary

Investing in RAS
Systems designed for RAS may be more costly at the “bill of materials” level than systems with little
investment in RAS.
Some examples as to why this could be so:
In terms of error detection and fault isolation: Simplified, at the low level, having an 8-bit bus takes a
certain amount of circuits. Adding an extra bit to detect a single fault, adds hardware to the bus. In a class
Hamming code, 5 bits of error checking data might be required for 15 bits of data to allow for double-bit
error detection, and single bit correction. Then there is the logic involved in generating the error detection
bits and checking/correcting for errors.
In some cases, better availability is achieved by having fully redundant components which more than
doubles the cost of the components, or by having some amount of n+1 redundancy or sparing which still
adds costs at a somewhat lesser rate.
In terms of reliability, highly reliable components will cost more. This may be true of the intrinsic design,
the materials used including the design of connectors, fans and power supplies.
Increased reliability in the way components are manufactured can also increase costs. Extensive time in
manufacture to test, a process to ”burn-in” parts and screen out weak modules increases costs. The
highest levels of reliability of parts may be achieved by rejecting entire lots –even good components -
when the failure rates overall for a lot are excessive. All of these increase the costs of the components.
Design for serviceability, especially for concurrent maintained typically is more involved than a design
where serviceability is not a concern. This is especially true when designing, for example, for concurrent
maintenance of components like I/O adapters.
Beyond the hardware costs, it takes development effort to code software to take advantage of the
hardware RAS features and time again to test for the many various ”bad-path” scenarios that can be
envisioned.
On the other hand, in all systems, scale-up and scale-out, investing in system RAS has a purpose. Just
as there is recurring costs for software licenses in most enterprise applications, there is a recurring cost
associated with maintaining systems. These include the direct costs, such as the cost for replacement
components and the cost associated with the labor required to diagnose and repair a system.
The somewhat more indirect costs of poor RAS are often the main reasons for investing in systems with
superior RAS characteristics and overtime these have become even more important to customers. The
importance is often directly related to:
The importance associated with discovery errors before relying on faulty data or computation
including the ability to know when to switch over to redundant or alternate resources.
The costs associated with downtime to do problem determination or error re-creation, if
insufficient fault isolation is provided in the system.
The cost of downtime when a system fails unexpectedly or needs to fail over when an application
is disrupted during the failover process.
The costs associated with planning an outage to or repair of hardware or firmware, especially
when the repair is not concurrent.
In a cloud environment, the operations cost of server evacuation.
In a well-designed system investing in RAS minimizes the need to repair components that are failing.
Systems that recover rather than crash and need repair when certain soft errors occur will minimize
indirect costs associated with such events. Use of selective self-healing so that, for example, a processor

POWER9 and Power10 processor-based systems RAS Page


94
does not have to be replaced simply because a single line of data on an I/O bus has a fault reduces
planned outage costs.
In scale-out environments the reliability of components and their serviceability can be measured and
weighed against the cost associated with maintaining the highest levels of reliability in a system.
In a scale-up environment, the indirect costs of outages and failovers usually outweigh the direct costs of
the repair. An emphasis is therefore put on designs that increase availability in the face of frequent costs
– such as having redundancy – even when the result is higher system and maintenance costs.

Final Word
The POWER9 and Power10 processor-based systems discussed leverage the long heritage of Power
Systems designed for RAS. The different servers aimed at different scale-up and scale-out environments
provide significant choice in selecting servers geared towards the application environments end-users will
deploy. The RAS features in each segment differ but in each provide substantial advantages compared to
designs with less of an up-front RAS focus.

About the principal authors/editors:


Daniel Henderson is an IBM Senior Technical Staff Member. He has been involved with POWER and
predecessor RISC based products development and support since the earliest RISC systems. He is
currently the lead system hardware availability designer for IBM Power Systems PowerVM based
platforms.
Irving Baysah is a Senior Hardware development engineer with over 20 years experience working on
IBM Power systems. He designed Memory Controller and GX I/O logic on multiple generations of IBM
Power processors. As a system integration and post silicon validation engineer, he successfully led the
verification of complex RAS functions and system features. He has managed a system cluster
automation team that developed tools to rapidly deploy in a Private Cloud, the PCIe Gen3/4/5 I/O bringup
team and the Power RAS architecture team. He is currently the lead RAS Architect for IBM Power
Systems.

POWER9 and Power10 processor-based systems RAS Page


95
Notices: This information is intended to give
a general understanding of
This information was developed concepts only. This information
for products and services offered could include technical
in the U.S.A. inaccuracies or typographical
IBM may not offer the products, errors. Changes may be made © IBM Corporation 2014-2021
services, or features discussed in periodically made to the
IBM Corporation
this document in other countries. information herein in new editions
Consult your local IBM of this publication. IBM may make Systems Group
representative for information on improvements and/or changes in
the product(s) and/or program(s) Route 100
the products and services
currently available in your area. described in this publication at any Somers, New York 10589
Any reference to an IBM product, time without notice.
Produced in the United States of
program, or service is not intended Any references in this information America
to state or imply that only that IBM to non-IBM websites are provided
product, program, or service may for convenience only and do not in September 2021
be used. Any functionally any manner serve as an All Rights Reserved
equivalent product, program, or endorsement of those websites.
service that does not infringe any The materials at those websites
IBM intellectual property right may are not part of the materials for
be used instead. However, it is the this IBM product and use of those
user's responsibility to evaluate websites is at your own risk
and verify the operation of any
non-IBM product, program, or Any performance data contained
service. herein was determined in a
controlled environment. Therefore,
IBM may have patents or pending the results obtained in other
patent applications covering operating environments may vary
subject matter described in this significantly. Some measurements
document. The furnishing of this may have been made on
document does not grant you any development-level systems and
license to these patents. You can there is no guarantee that these
send license inquiries, in writing, measurements will be the same on
to:IBM Director of Licensing, IBM generally available systems.
Corporation, North Castle Drive, Furthermore, some measurements
Armonk, NY 10504-1785 U.S.A. may have been estimated through
The following paragraph does not extrapolation. Actual results may
apply to the United Kingdom or vary. Users of this document
any other country where such should verify the applicable data
provisions are inconsistent with for their specific environment.
local law: INTERNATIONAL Information concerning non-IBM
BUSINESS MACHINES products was obtained from the
CORPORATION PROVIDES suppliers of those products, their
THIS PUBLICATION "AS IS" published announcements, or
WITHOUT WARRANTY OF ANY other publicly available sources.
KIND, EITHER EXPRESS OR IBM has not tested those products
IMPLIED, INCLUDING, BUT NOT and cannot confirm the accuracy
LIMITED TO, THE IMPLIED of performance, compatibility or
WARRANTIES OF NON- any other claims related to non-
INFRINGEMENT, IBM products. Questions on the
MERCHANTABILITY OR capabilities of non-IBM products
FITNESS FOR A PARTICULAR should be addressed to the
PURPOSE. Some states do not suppliers of those products.
allow disclaimer of express or
implied warranties in certain
transactions; therefore, this
statement may not apply to you.

POWER9 and Power10 processor-based systems RAS Page


96

You might also like