0% found this document useful (0 votes)
105 views

Ibm Doc Mpt2sas 2011-09 Service-Guide

IBM Doc MPT2SAS

Uploaded by

fajar setiawan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views

Ibm Doc Mpt2sas 2011-09 Service-Guide

IBM Doc MPT2SAS

Uploaded by

fajar setiawan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

SAS Host Bus Adapters 

Problem Determination and Service Guide


SAS Host Bus Adapters 

Problem Determination and Service Guide


Note: Before using this information and the product it supports, read the general information in Appendix B, “Notices,” on page 59.

First Edition (September 2011)


© Copyright IBM Corporation 2010.
US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
Contents
Safety . . . . . . . . . . . . . . . . . . . . v. . . . . . . .
Guidelines for trained service technicians . . . . . . .
. vi . . . . . . .
Inspecting for unsafe conditions . . . . . . . . . .
. vi . . . . . . .
Guidelines for servicing electrical equipment . . . . .
. vi . . . . . . .
Safety statements . . . . . . . . . . . . . . . . . . . . . . . . viii

Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . 1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Related documentation . . . . . . . . . . . . . . . . . . . . . . 1
Notices and statements in this document . . . . . . . . . . . . . . . . 2
Installation guidelines . . . . . . . . . . . . . . . . . . . . . . . 2
System reliability guidelines. . . . . . . . . . . . . . . . . . . . 3
Working inside the server with the power on . . . . . . . . . . . . . 4
Handling static-sensitive devices . . . . . . . . . . . . . . . . . . 4
Returning a device or component . . . . . . . . . . . . . . . . . 5

Chapter 2. Start here. . . . . . . . . . . . . . . . . . . . . . . 7

Chapter 3. Understanding the operating environment . . . . . . . . . 11


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Using the PDSG in a mixed RAID and non-RAID environment (configured with a
ServeRAID H1110). . . . . . . . . . . . . . . . . . . . . . . 11
IBM servers that have a hard disk drive backplane. . . . . . . . . . . . 12
IBM servers that support simple-swap hard disk drives . . . . . . . . . . 12

Chapter 4. ServeRAID H1110 features. . . . . . . . . . . . . . . . 13

Chapter 5. Physical hard disk drive utilities . . . . . . . . . . . . . 15

Chapter 6. General problem determination tips . . . . . . . . . . . . 17

Chapter 7. Problem determination procedures . . . . . . . . . . . . 19


Hard disk drive LED-to-action . . . . . . . . . . . . . . . . . . . 19
POST messages-to-action. . . . . . . . . . . . . . . . . . . . . 21
Fault codes . . . . . . . . . . . . . . . . . . . . . . . . . 22
Boot messages . . . . . . . . . . . . . . . . . . . . . . . . 24
Event messages-to-actions . . . . . . . . . . . . . . . . . . . . 27
Symptoms-to-actions. . . . . . . . . . . . . . . . . . . . . . . 50
The SAS HBA is not seen during POST, or the Preboot GUI is not accessible 50
One or more SAS HBAs are inaccessible when multiple storage controllers
are installed . . . . . . . . . . . . . . . . . . . . . . . . 52
System events-to-actions index . . . . . . . . . . . . . . . . . . . 53

Chapter 8. Replaceable components . . . . . . . . . . . . . . . . 55

Appendix A. Getting help and technical assistance . . . . . . . . . . 57


Before you call . . . . . . . . . . . . . . . . . . . . . . . . . 57
Using the documentation . . . . . . . . . . . . . . . . . . . . . 57
Getting help and information from the World Wide Web . . . . . . . . . . 57
Software service and support . . . . . . . . . . . . . . . . . . . 58
Hardware service and support . . . . . . . . . . . . . . . . . . . 58
IBM Taiwan product service . . . . . . . . . . . . . . . . . . . . 58

© Copyright IBM Corp. 2010 iii


Appendix B. Notices . . . . . . . . . . . . . . . . . . . . . . 59
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Important notes. . . . . . . . . . . . . . . . . . . . . . . . . 60

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

iv SAS Host Bus Adapters: Problem Determination and Service Guide


Safety
Before installing this product, read the Safety Information.

Antes de instalar este produto, leia as Informações de Segurança.

Læs sikkerhedsforskrifterne, før du installerer dette produkt.

Lees voordat u dit product installeert eerst de veiligheidsvoorschriften.

Ennen kuin asennat tämän tuotteen, lue turvaohjeet kohdasta Safety Information.

Avant d'installer ce produit, lisez les consignes de sécurité.

Vor der Installation dieses Produkts die Sicherheitshinweise lesen.

Prima di installare questo prodotto, leggere le Informazioni sulla Sicurezza.

Les sikkerhetsinformasjonen (Safety Information) før du installerer dette produktet.

Antes de instalar este produto, leia as Informações sobre Segurança.

Antes de instalar este producto, lea la información de seguridad.

Läs säkerhetsinformationen innan du installerar den här produkten.

© Copyright IBM Corp. 2010 v


Guidelines for trained service technicians
This section contains information for trained service technicians.

Inspecting for unsafe conditions


Use the information in this section to help you identify potential unsafe conditions in
an IBM product that you are working on. Each IBM product, as it was designed and
manufactured, has required safety items to protect users and service technicians
from injury. The information in this section addresses only those items. Use good
judgment to identify potential unsafe conditions that might be caused by non-IBM
alterations or attachment of non-IBM features or optional devices that are not
addressed in this section. If you identify an unsafe condition, you must determine
how serious the hazard is and whether you must correct the problem before you
work on the product.

Consider the following conditions and the safety hazards that they present:
v Electrical hazards, especially primary power. Primary voltage on the frame can
cause serious or fatal electrical shock.
v Explosive hazards, such as a damaged CRT face or a bulging capacitor.
v Mechanical hazards, such as loose or missing hardware.

To inspect the product for potential unsafe conditions, complete the following steps:
1. Make sure that the power is off and the power cord is disconnected.
2. Make sure that the exterior cover is not damaged, loose, or broken, and
observe any sharp edges.
3. Check the power cord:
v Make sure that the third-wire ground connector is in good condition. Use a
meter to measure third-wire ground continuity for 0.1 ohm or less between
the external ground pin and the frame ground.
v Make sure that the power cord is the correct type.
v Make sure that the insulation is not frayed or worn.
4. Remove the cover.
5. Check for any obvious non-IBM alterations. Use good judgment as to the safety
of any non-IBM alterations.
6. Check inside the server for any obvious unsafe conditions, such as metal filings,
contamination, water or other liquid, or signs of fire or smoke damage.
7. Check for worn, frayed, or pinched cables.
8. Make sure that the power-supply cover fasteners (screws or rivets) have not
been removed or tampered with.

Guidelines for servicing electrical equipment


Observe the following guidelines when you service electrical equipment:
v Check the area for electrical hazards such as moist floors, nongrounded power
extension cords, and missing safety grounds.
v Use only approved tools and test equipment. Some hand tools have handles that
are covered with a soft material that does not provide insulation from live
electrical currents.
v Regularly inspect and maintain your electrical hand tools for safe operational
condition. Do not use worn or broken tools or testers.

vi SAS Host Bus Adapters: Problem Determination and Service Guide


v Do not touch the reflective surface of a dental mirror to a live electrical circuit.
The surface is conductive and can cause personal injury or equipment damage if
it touches a live electrical circuit.
v Some rubber floor mats contain small conductive fibers to decrease electrostatic
discharge. Do not use this type of mat to protect yourself from electrical shock.
v Do not work alone under hazardous conditions or near equipment that has
hazardous voltages.
v Locate the emergency power-off (EPO) switch, disconnecting switch, or electrical
outlet so that you can turn off the power quickly in the event of an electrical
accident.
v Disconnect all power before you perform a mechanical inspection, work near
power supplies, or remove or install main units.
v Before you work on the equipment, disconnect the power cord. If you cannot
disconnect the power cord, have the customer power-off the wall box that
supplies power to the equipment and lock the wall box in the off position.
v Never assume that power has been disconnected from a circuit. Check it to
make sure that it has been disconnected.
v If you have to work on equipment that has exposed electrical circuits, observe
the following precautions:
– Make sure that another person who is familiar with the power-off controls is
near you and is available to turn off the power if necessary.
– When you are working with powered-on electrical equipment, use only one
hand. Keep the other hand in your pocket or behind your back to avoid
creating a complete circuit that could cause an electrical shock.
– When you use a tester, set the controls correctly and use the approved probe
leads and accessories for that tester.
– Stand on a suitable rubber mat to insulate you from grounds such as metal
floor strips and equipment frames.
v Use extreme care when you measure high voltages.
v To ensure proper grounding of components such as power supplies, pumps,
blowers, fans, and motor generators, do not service these components outside of
their normal operating locations.
v If an electrical accident occurs, use caution, turn off the power, and send another
person to get medical aid.

Safety vii
Safety statements
Important:

Each caution and danger statement in this document is labeled with a number. This
number is used to cross reference an English-language caution or danger
statement with translated versions of the caution or danger statement in the Safety
Information document.

For example, if a caution statement is labeled “Statement 1,” translations for that
caution statement are in the Safety Information document under “Statement 1.”

Be sure to read all caution and danger statements in this document before you
perform the procedures. Read any additional safety information that comes with the
server or optional device before you install the device.

viii SAS Host Bus Adapters: Problem Determination and Service Guide
Statement 1:

DANGER

Electrical current from power, telephone, and communication cables is


hazardous.

To avoid a shock hazard:


v Do not connect or disconnect any cables or perform installation,
maintenance, or reconfiguration of this product during an electrical
storm.
v Connect all power cords to a properly wired and grounded electrical
outlet.
v Connect to properly wired outlets any equipment that will be attached to
this product.
v When possible, use one hand only to connect or disconnect signal
cables.
v Never turn on any equipment when there is evidence of fire, water, or
structural damage.
v Disconnect the attached power cords, telecommunications systems,
networks, and modems before you open the device covers, unless
instructed otherwise in the installation and configuration procedures.
v Connect and disconnect cables as described in the following table when
installing, moving, or opening covers on this product or attached
devices.

To Connect: To Disconnect:

1. Turn everything OFF. 1. Turn everything OFF.


2. First, attach all cables to devices. 2. First, remove power cords from outlet.
3. Attach signal cables to connectors. 3. Remove signal cables from connectors.
4. Attach power cords to outlet. 4. Remove all cables from devices.
5. Turn device ON.

Safety ix
x SAS Host Bus Adapters: Problem Determination and Service Guide
Chapter 1. Introduction
This Problem Determination and Service Guide provides guidance for
troubleshooting IBM Host Bus Adapters (HBAs).

Overview
The following SAS HBAs are supported by this document:
v ServeRAID H1110 SAS/SATA Controller for IBM System x
v IBM 6 Gb Performance Optimized HBA
v IBM 6 Gb SAS HBA

The following software is supported by this document:


v MegaRAID Storage Manager (MSM) (for the ServeRAID H1110 only)
v SAS2 BIOS Configuration Utility
v SAS2 Integrated RAID Configuration Utility (sas2ircu command-line tools)

Related documentation
The following documentation comes with the SAS HBAs:
v Quick Installation Guide (product-specific document)
This printed document provides the instructions for installing the SAS HBA
hardware. You can also downloaded the Portable Document Format (PDF) of this
document from the IBM Storage Matrix at http://www-947.ibm.com/systems/
support/supportsite.wss/docdisplay?lndocid=SERV-RAID&brandind=5000008.
v Installation and User's Guide (product-specific document)
This document is in PDF and provides detailed information for using the SAS
HBA hardware and software. You can also downloaded this document from the
IBM Storage Matrix at http://www-947.ibm.com/systems/support/supportsite.wss/
docdisplay?lndocid=SERV-RAID&brandind=5000008.
v SAS2 BIOS Configuration Utility User's Guide
This document is in PDF and provides detailed information about using the SAS2
BIOS Configuration Utility to configure the SAS HBA.
v SAS2 Integrated RAID Configuration Utility User's Guide
This document is in PDF and provides detailed information about using the SAS2
Integrated RAID Configuration Utility to configure the SAS HBA.
v MegaRAID Storage Manager (MSM) User's Guide (for the ServeRAID H1110
only)
This document is in PDF and provides detailed information about using the
MegaRAID Storage Manager software package for managing and configuring
installed ServeRAID controllers.

IBM publishes updates for known issues on a regular basis. For a problem that is
not covered by the documentation that comes with the SAS HBA or in this Problem
Determination and Service Guide, go to http://www.ibm.com/systems/support/.

© Copyright IBM Corp. 2010 1


Notices and statements in this document
The caution and danger statements in this document are also in the multilingual
Safety Information document, which is on the Documentation CD. Each statement is
numbered for reference to the corresponding statement in your language in the
Safety Information document.

The following notices and statements are used in this document:


v Note: These notices provide important tips, guidance, or advice.
v Important: These notices provide information or advice that might help you avoid
inconvenient or problem situations.
v Attention: These notices indicate potential damage to programs, devices, or
data. An attention notice is placed just before the instruction or situation in which
damage might occur.
v Caution: These statements indicate situations that can be potentially hazardous
to you. A caution statement is placed just before the description of a potentially
hazardous procedure step or situation.
v Danger: These statements indicate situations that can be potentially lethal or
extremely hazardous to you. A danger statement is placed just before the
description of a potentially lethal or extremely hazardous procedure step or
situation.

Installation guidelines
Before you remove or replace a component, read the following information:
v Read the safety information that begins on page v, the guidelines in “Working
inside the server with the power on” on page 4, and “Handling static-sensitive
devices” on page 4. This information will help you work safely.
v When you install your new server, take the opportunity to download and apply
the most recent firmware updates.
Important: Some cluster solutions require specific code levels or coordinated
code updates. If the device is part of a cluster solution, verify that the latest level
of code is supported for the cluster solution before you update the code. This
step will help to ensure that any known issues are addressed and that your
server is ready to function at maximum levels of performance. To download
firmware updates for your server, complete the following steps:
1. Go to http://www.ibm.com/systems/support/.
2. Under Product support, click System x.
3. Under Popular links, click Software and device drivers.
4. Click System x3650 M2 to display the matrix of downloadable files for the
server.
For additional information about tools for updating, managing, and deploying
firmware, see the System x and xSeries Tools Center at
http://publib.boulder.ibm.com/infocenter/toolsctr/v1r0/index.jsp.
v Before you install optional hardware, make sure that the server is working
correctly. Start the server, and make sure that the operating system starts, if an
operating system is installed, or that a 19990305 error code is displayed,
indicating that an operating system was not found but the server is otherwise
working correctly. If the server is not working correctly, see Chapter 7, “Problem
determination procedures,” on page 19 for diagnostic information.
v Observe good housekeeping in the area where you are working. Place removed
covers and other parts in a safe place.

2 SAS Host Bus Adapters: Problem Determination and Service Guide


v If you must start the server while the cover is removed, make sure that no one is
near the server and that no tools or other objects have been left inside the
server.
v Do not attempt to lift an object that you think is too heavy for you. If you have to
lift a heavy object, observe the following precautions:
– Make sure that you can stand safely without slipping.
– Distribute the weight of the object equally between your feet.
– Use a slow lifting force. Never move suddenly or twist when you lift a heavy
object.
– To avoid straining the muscles in your back, lift by standing or by pushing up
with your leg muscles.
v Make sure that you have an adequate number of properly grounded electrical
outlets for the server, monitor, and other devices.
v Back up all important data before you make changes to disk drives.
v Have a small flat-blade screwdriver available.
v To view the error LEDs on the system board and internal components, leave the
server connected to power.
v You do not have to turn off the server to install or replace hot-swap fans,
redundant hot-swap ac power supplies, or hot-plug Universal Serial Bus (USB)
devices. However, you must turn off the server before you perform any steps that
involve removing or installing adapter cables or non-hot-swap optional devices or
components.
v Blue on a component indicates touch points, where you can grip the component
to remove it from or install it in the server, open or close a latch, and so on.
v Orange on a component or an orange label on or near a component indicates
that the component can be hot-swapped, which means that if the server and
operating system support hot-swap capability, you can remove or install the
component while the server is running. (Orange can also indicate touch points on
hot-swap components.) See the instructions for removing or installing a specific
hot-swap component for any additional procedures that you might have to
perform before you remove or install the component.
v When you are finished working on the server, reinstall all safety shields, guards,
labels, and ground wires.
v For a list of supported optional-devices for the server, see http://www.ibm.com/
servers/eserver/serverproven/compat/us/.

System reliability guidelines


To help ensure proper cooling and system reliability, make sure that:
v Each of the drive bays has a drive or a filler panel and electromagnetic
compatibility (EMC) shield installed in it.
v If the server has redundant power, each of the power-supply bays has a power
supply installed in it.
v There is adequate space around the server to allow the server cooling system to
work properly. Leave approximately 50 mm (2.0 in.) of open space around the
front and rear of the server. Do not place objects in front of the fans. For proper
cooling and airflow, replace the server cover before turning on the server.
Operating the server for extended periods of time (more than 30 minutes) with
the server cover removed might damage server components.
v You have followed the cabling instructions that come with optional adapters.
v You have replaced a failed fan within 48 hours.

Chapter 1. Introduction 3
v You have replaced a hot-swap fan within 30 seconds of removal.
v You do not operate the server without the air baffles installed. Operating the
server without the air baffles might cause the microprocessor to overheat.

Working inside the server with the power on


Attention: Static electricity that is released to internal server components when
the server is powered-on might cause the server to halt, which might result in the
loss of data. To avoid this potential problem, always use an electrostatic-discharge
wrist strap or other grounding system when you work inside the server with the
power on.

The server supports hot-plug, hot-add, and hot-swap devices and is designed to
operate safely while it is turned on and the cover is removed. Follow these
guidelines when you work inside a server that is turned on:
v Avoid wearing loose-fitting clothing on your forearms. Button long-sleeved shirts
before working inside the server; do not wear cuff links while you are working
inside the server.
v Do not allow your necktie or scarf to hang inside the server.
v Remove jewelry, such as bracelets, necklaces, rings, and loose-fitting wrist
watches.
v Remove items from your shirt pocket, such as pens and pencils, that could fall
into the server as you lean over it.
v Avoid dropping any metallic objects, such as paper clips, hairpins, and screws,
into the server.

Handling static-sensitive devices


Attention: Static electricity can damage the server and other electronic devices.
To avoid damage, keep static-sensitive devices in their static-protective packages
until you are ready to install them.

To reduce the possibility of damage from electrostatic discharge, observe the


following precautions:
v Limit your movement. Movement can cause static electricity to build up around
you.
v The use of a grounding system is recommended. For example, wear an
electrostatic-discharge wrist strap, if one is available. Always use an
electrostatic-discharge wrist strap or other grounding system when you work
inside the server with the power on.
v Handle the device carefully, holding it by its edges or its frame.
v Do not touch solder joints, pins, or exposed circuitry.
v Do not leave the device where others can handle and damage it.
v While the device is still in its static-protective package, touch it to an unpainted
metal surface on the outside of the server for at least 2 seconds. This drains
static electricity from the package and from your body.
v Remove the device from its package and install it directly into the server without
setting down the device. If it is necessary to set down the device, put it back into
its static-protective package. Do not place the device on the server cover or on a
metal surface.
v Take additional care when handling devices during cold weather. Heating reduces
indoor humidity and increases static electricity.

4 SAS Host Bus Adapters: Problem Determination and Service Guide


Returning a device or component
If you are instructed to return a device or component, follow all packaging
instructions, and use any packaging materials for shipping that are supplied to you.

Chapter 1. Introduction 5
6 SAS Host Bus Adapters: Problem Determination and Service Guide
Chapter 2. Start here
You can solve many problems without outside assistance by following the
troubleshooting procedures in this Problem Determination and Service Guide and
on the IBM website. This document describes the troubleshooting procedures and
explanations of event messages and error codes. The documentation that comes
with your operating system and software also contains troubleshooting information.

Before you contact IBM or an approved warranty service provider, follow these
procedures in the order in which they are presented to diagnose a problem with the
server or the SAS HBA:
1. Determine what has changed.
A SAS HBA issue can occur when changes are introduced into an operational
server. If there is a clear cause and effect to a change, back out the change
until a workaround or a fix is available. If the recent change status is unknown,
determine whether any of the following items were added, removed, replaced,
or updated before the problem occurred:
v System Unified Extensible Firmware Interface (UEFI) or basic input/output
system (BIOS) code
v ServeRAID controller BIOS or firmware
v System or ServeRAID device drivers
v Other hardware components
v Other software or device drivers
v Any software configuration changes

Note: IBM does not support updating to previous versions of SAS HBA BIOS
and firmware packages.
2. Collect data.
Thorough data collection is necessary for effectively diagnosing hardware and
software problems. The following clues are used to determine the best approach
to solving specific problems:
Document event messages, error codes, and system-board LEDs.
v Check the system-events logs for hardware faults within the integrated
management module (IMM), baseboard management controller (BMC), or
Remote Supervisor Adapter (RSA) logs, as applicable to the specific server.
v Check for operating-system event messages.
v Check the MegaRAID Storage Manager (MSM) software for event messages.
v Document the light path diagnostics LEDs and the LEDs for the attached disk
drives.
v Observe the server for POST messages as the server starts.
v Observe and record any suspect controller or hard disk drive behavior.
3. Programmatically collect system data by using IBM Dynamic System
Analysis (DSA).
If a server can boot to the operating system, Dynamic System Analysis (DSA)
can programmatically collect important system and configuration information that
you can use to diagnose the problem.
Run DSA to collect information about the hardware, firmware, software, and
operating system. Have this information available when you contact IBM or an
approved warranty service provider. For more information about running DSA,

© Copyright IBM Corp. 2010 7


see the IBM Dynamic System Analysis Installation and User's Guide, which is
available on the DSA download web page.
If you have to download the latest version of DSA, go to http://www.ibm.com/
systems/support/supportsite.wss/docdisplay?brandind=5000008&lndocid=SERV-
DSA.
4. Follow the problem-resolution procedures.
The following four problem-resolution procedures are presented in the order in
which they are most likely to solve your problem. Follow these procedures in the
order in which they are presented:
a. Apply software updates.
IBM incorporates all known fixes into the latest release of software and
firmware for the SAS HBAs. Most known issues are corrected by updates to
the software and device drivers for the hardware components. This is the
first step in eliminating known issues that might be causing problems.
Server software can also affect the behavior of the SAS HBAs. You must
update the server with the latest versions of available software to eliminate
known issues. All systems and SAS HBA software updates include “change
history” documentation that describes the changes, fixes, or improvements
that are made to the software. A change history file has a .chg extension.
This file is a plain text downloadable document that is available at the same
location where the updated software is downloaded.

Important: Software and device driver updates are best applied to correct
behavioral problems within the subsystem or to improve stability. If the
server or SAS HBA subsystem is in an Offline or Failed state, it is best not
to attempt any updates to the software until the system and configuration
are stabilized. After a system experiences a failure, it is usually best to bring
the system into an operational state and then apply the software updates.
All SAS HBA software is available on the IBM ServeRAID software matrix
web page at http://www.ibm.com/systems/support/supportsite.wss/
docdisplay?lndocid=SERV-RAID&brandind=5000008.
The following components come with the software:
v Utilities (MSM and command-line tools)
v SAS HBA firmware updates
v Device drivers
v Documentation (user guide and device-driver installation guide)
The following other important software updates are in the server support
section:
v Hard disk drive firmware updates
v Enclosure unit updates
v System software updates
After you apply software updates, observer the SAS HBA for correct
operation. See the next section if the problem is not solved.

8 SAS Host Bus Adapters: Problem Determination and Service Guide


b. Controller hardware checkout procedure.
Review the SAS HBA hardware and software configuration for correct
installation.
Safety: Power off the server before you follow these checkout procedures.
v SAS HBA
– Reseat the HBA in the PCI slot
– Align and secure the chassis brackets in the slot correctly. This is very
important if you are installing the HBA on a riser-card assembly before
you install it in the server.
– Review the server documentation to make sure that the expansion-slot
restrictions are observed. A system might limit the use of some slots
because of thermal issues, fit restrictions, or interference with other
internal components.
v SAS/SATA cables
– Reseat any SAS/SATA cable connections. Each connection must latch
and click into place from the controller to the backplane.
– In simple-swap configurations, the SAS/SATA cables might be attached
directly to the drive or to a simple-swap connector at the back of a
drive cage that is connected to the system board.
– Make sure that each cable has the correct bend radius. Exceeding the
bend radius as outlined in the server documentation can add stress to
the components.
– Make sure that the SAS/SATA cables are not overstretched, nicked, or
damaged.
v Internal power cables
– Backplane power cables are keyed to ensure that they attached
correctly to the server and the disk drive backplane. Most power
cables latch with a plastic connector. Reseat the backplane power
cables.
– In a simple-swap configuration, the power cables are connected
directly to a drive, or the simple-swap connector at the back of the
drive cage is attached directly to the power supply.
v I2C
The I2C cable is connected from the hot-swap backplane to the system
board. This cable controls the amber LEDs for the hard disk drives and
the out-of-band alert notifications. Reseat the I2C cables.
v Backplanes
Make sure that the backplanes are seated correctly by using the
information in the server documentation. An incorrectly seated or aligned
backplane can cause hard disk drive related problems because of a bad
connection to a disk drive. Inspect the seating of the backplane and
reseat as needed.
v Hard disk drives (including solid state drives)
– Reseat the hot-swap drives against the backplane to make sure that
they are installed correctly.
– A simple-swap server might require removal of the front bezel to gain
access to the hard disks drives to reseat them.

Chapter 2. Start here 9


c. Symptom-based problem determination
v Go to “Hard disk drive LED-to-action” on page 19.
v Go to “POST messages-to-action” on page 21.
v Go to “Event messages-to-actions” on page 27.
v Go to “Symptoms-to-actions” on page 50.
v Go to “System events-to-actions index” on page 53.
v Check for updated troubleshooting procedures and RETAIN tips.
d. RETAIN tips
Troubleshooting procedures and RETAIN tips document known problems
and suggested solutions. To search for troubleshooting procedures and
RETAIN tips, complete the following steps.

Note: Changes are made periodically to the IBM Web site. The actual
procedure might vary slightly from what is described in this document.
1) Go to http://www.ibm.com/systems/support/.
2) Under Product support, click System x.
3) From the Product family list, select the server.
4) Under Support & downloads, click Troubleshoot.
5) Select the troubleshooting procedure or RETAIN tip that applies to your
problem:
v Troubleshooting procedures are under Diagnostic.
v RETAIN tips are under Troubleshoot.
e. Check for and replace defective hardware.
v Replace hardware determined to be defective using the problem
determination procedures.
v See Chapter 8, “Replaceable components,” on page 55 for more details.

10 SAS Host Bus Adapters: Problem Determination and Service Guide


Chapter 3. Understanding the operating environment
This chapter contains information about the SAS HBA operating environment.

Overview
Read the following information about the SAS HBA operating environment:
v The ServeRAID H1110 supports up to two RAID volumes (RAID-0, RAID-1, and
RAID-10) and supports hard disk drives that operate as individual physical
SAS/SATA disks. This is sometimes known as a drive operating in JBOD (just a
bunch of disks) configurations.
v The following IBM SAS HBAs support only individual physical SAS/SATA disks in
JBOD configurations:
– IBM 6 Gb Performance Optimized HBA
– IBM 6 Gb SAS HBA
v The ServeRAID H1110 supports the following RAID levels. For more information
about RAID levels, see the documentation that comes with the controller.
– RAID-0 volumes are not redundant and provide no protections from a single
disk failure.
– Integrated mirroring is a RAID-1 volume and consists of a simple mirror of two
drives providing redundancy if the RAID volume experiences a single disk
failure.
– Integrated mirroring enhanced is a RAID-1E volume that can configure an odd
number of drives into a mirrored volume.
– Integrated mirroring with striping is a RAID-10 volume and consists of a span
with two RAID-1 volumes striping the usable space into one virtual disk. Each
RAID-1 span is redundant, which means a single disk failure in each can
occur without losing data, but two drive failures in the same RAID-1 span
results in a lost RAID volume.

Using the PDSG in a mixed RAID and non-RAID environment


(configured with a ServeRAID H1110)
For the ServeRAID H1110, the operating environment might be a mix of RAID
volumes and one or more non-RAID drives on the same controller. Even though the
hard disk drives might be attached to the same controller, hard disk drive issues are
reported differently between drives configured in a RAID volume and a physical
drive.

To maintain data coherency between hard disk drives configured in a RAID-1 or


RAID-10 volume, the drives must conform to error recovery standards defined by
the RAID controller firmware. For example, if two drives are configured in a mirror,
and one drive has an unrecoverable error or a series of errors that cannot be
corrected, the RAID controller rejects the drive from the RAID volume by marking
the drive as Failed. The unrecovered errors create inconsistencies between the
data on the two drives in the mirror. If the RAID controller cannot correct the
inconsistencies within a short period of time, the problem drive must be removed
from the mirrored drive group. The RAID controller programmatically makes these
decisions to ensure data consistency within the RAID volume. The resulting failed
drive is a logical drive state used to ensure the RAID logic does not try to use a
problematic drive again. From the operating system perspective, errors from a

© Copyright IBM Corp. 2010 11


single drive in a RAID volume do not typically generate errors while good data is
available from the redundant disk in the mirror.

Physical hard disk drives operating without RAID do not have logical states. Drives
in this configuration are treated as standard SAS or SATA devices. Controllers and
hard disk drives have logic to try to recover from problems to continue operation,
but without redundancy, data might be lost. It is important to note that the
ServeRAID H1110 will continue to work with the device as long as the controller and
drive can communicate. The ServeRAID H1110 will not mark a physical drive as
Failed unless it is in a RAID volume.

To be effective with problem determination, these operating environments must be


approached differently. Troubleshooting a configuration with a RAID volume is best
accomplished by using the utilities and features available for the RAID controller.
When you troubleshoot physical hard disk drives, use the hard disk drive diagnostic
tools that are available for IBM servers (for example, after you start the server,
press F2 to start DSA Preboot and view the event logs) or the feature utilities
provided in the BIOS configuration utility (CTRL+C). RAID controllers are capable of
doing many of the same things that the diagnostic tools accomplish during normal
operation.

IBM servers that have a hard disk drive backplane


IBM servers that have a hard disk drive backplane have several advantages over
systems that only support simple-swap. A drive backplane adds support for
hot-swap drives, which means the ability to insert and remove the disks from the
front of the server under specific circumstances. Another important advantage is the
drive LEDs that offer visual indicators for the current status of each hard disk drive.
For more information, see the Table 1 on page 19.

The following hot swap events are supported:


v Removal of a failed or unconfigured hot-swap hard disk drive
v Installation of a replacement hot-swap hard disk drive
v Installation of a new hot-swap hard disk drive into an empty drive bay

Note: The removal of an operational hot-swap hard disk drive is not supported.

IBM servers that support simple-swap hard disk drives


IBM servers that support simple-swap hard disk drives have guide rails in the drive
bay that help you insert the drive into the cable connector at the back of the drive
cage. You must install a simple-swap hard disk drive when the server is powered
off. There is no backplane or electronics to enable simple-swap drives to be
installed when the server is turned on. Installing or removing simple-swap drives
with the server turned on is not supported.

12 SAS Host Bus Adapters: Problem Determination and Service Guide


Chapter 4. ServeRAID H1110 features
The ServeRAID H1110 SAS/SATA Controller for IBM System x uses the following
features and technologies:
v Resynchronization
The ServeRAID H1110 is designed to automatically resynchronize degraded
RAID volumes in the background when a hot spare or a replacement disk is
detected. While the resynchronization is running, the RAID volume continues to
be accessible for normal operation.
v Hot-swap disk replacement
The ServeRAID H1110 controller supports hot-swap disk replacement, and
automatically resynchronizes hot-swapped disks in the background without any
host or user intervention when the replacement drive meets the disk capacity and
drive type requirements. Drive type requirements are such that SAS and SATA
drives cannot be mixed within the same RAID volume. The controller detects
hot-swap drive removal and new drive installation events with a supported
hot-swap backplane.

Note: If a hot spare is configured, a rebuild to the spare begins automatically


and the replacement drive becomes the new hot spare.
After a hot-swap event, the firmware makes sure that the new physical disk has
enough capacity for the mirrored volume. The firmware resynchronizes all
replaced hot-swapped disks, even if the same disk is reinserted. In a mirrored
volume with an even number of disks, the firmware marks the hot-swapped disk
as a secondary disk and the other disk with data as the primary disk. The
firmware resynchronizes all data from the primary disk onto the new secondary
disk. In a mirrored volume with an odd number of disks, primary and secondary
sets include three disks instead of two.
v Simple-swap disk replacement
In simple-swap configuration, a failed disk must be replaced while the server is
powered off. When the server is powered on, the controller detects the replaced
disk during startup and automatically begins a rebuild to the new drive, if the
replacement drive meets the disk capacity and drive type requirements. Drive
type requirement are defined such that SAS and SATA drives cannot be mixed
within the same RAID volume.

Note: If a hot-spare drive is configured, a rebuild to the spare begins


automatically and the replacement drive becomes the new hot-spare drive.
v Hot-spare drives
You can configure hot-spare drives to protect data on the mirrored volumes. Up
to two global hot spares can be configured on the IBM SAS HBAs. If the
integrated RAID firmware fails one of the mirrored drives, it automatically
replaces the failed drive with a hot-spare drive and then resynchronizes the
mirrored data. The firmware automatically receives a notification when a failed
drive is replaced by a hot-spare drive, and it then designates that drive as the
new hot-spare drive. The firmware periodically checks a hot-spare rebuild
process so the rebuild can continue from where it stopped, if the server is
restarted before the rebuild is completed.

© Copyright IBM Corp. 2010 13


v Online capacity expansion (OCE)
The OCE feature enables you to expand the capacity of an existing two-disk
integrated mirroring (RAID-1) volume by replacing the original hard disk drives
with higher capacity drives that have the same drive type (SAS or SATA).

Note: The new drives must have at least 50 GB more capacity than the original
drives of the volume.
After you replace the hard disk drives and run the OCE command, you must use
an independent software vendor tool that is specific to the operating system to
move or increase the size of the partition on the volume.
v Disk write caching
By default, the integrated RAID firmware disables disk write caching for mirrored
volumes. It does this to make sure that data is not lost during an unexpected
power outage. Do not enable write caching because it significantly increases the
risk of data loss if an unexpected power outage occurs.
v Background initialization (BGI)
BGI is the process of copying data from primary to secondary disks in a mirrored
volume. The integrated RAID firmware starts BGI automatically as a background
task when it creates a new RAID volume. The volume remains in the Optimal
state while BGI is in progress.
v Consistency check
A consistency check is a background process that reads data from primary and
secondary disks in a mirrored volume, and compares it to make sure that the
data is identical on both disks. Any inconsistencies are corrected if they are
found.

14 SAS Host Bus Adapters: Problem Determination and Service Guide


Chapter 5. Physical hard disk drive utilities
The ServeRAID H1110 and the IBM SAS HBA feature several utilities that are
designed to work with physical hard disks drives. The tools are accessible through
the BIOS configuration utility for the SAS HBA.
v Press CTRL+C, select HBA, and select SAS Topology. Expand Direct Attached
Devices and select the disk device. Press ALT+D to open the Device Properties
window and select the utility.

Note: These tools are not available in the UEFI Human Interface Infrastructure.
v Format. The format tool is a very robust method to low-level format a hard disk
drive. You cannot cancel the format tool after it is started. While the format tool is
running, all errors are handled and corrected or the format process fails. When
you run the format tool, all data on the hard disk drive is permanently erased. If a
hard disk drive fails to successfully complete a low-level format, the drive is bad.
If the drive successfully completes the formatting process, the drive is usually
good.
v Verify. The verify tool performs a non-destructive read test on every sector of the
hard disk drive and all errors are handled and corrected or the verify process
fails. If the verify process is successful, there is a high degree of confidence that
the drive is good with the caveat that no writes were performed during the test.
An unsuccessful verify process indicates read problems that cannot be corrected.
The drive is going bad and data loss is likely.

Note: The verify tool is available only for SAS hard disk drives.

These utilities are accessible only while the server is offline; therefore, a
maintenance window of several hours is usually required to perform these tests.

© Copyright IBM Corp. 2010 15


16 SAS Host Bus Adapters: Problem Determination and Service Guide
Chapter 6. General problem determination tips
For problem determination, use the following general tips:
v Using consistency checks to diagnose RAID volumes on the ServeRAID
H1110
For RAID volumes, running a periodic consistency check is important to make
sure that drive maintenance occurs and that every sector receives attention from
the controller. This reduces the number of errors that might occur when an
application needs the data when this background operation periodically corrects
inconsistencies. The automated media verification is the key. If the hard disk
drives in a RAID volume successfully complete a consistency check without a
disk failure, the drives are usually good. Serious drive problems are reported by
the ServeRAID H1110 to the MegaRAID Storage Manager (MSM) software as
event messages. After the consistency check is completed, you can review the
event messages in MSM to see if any serious events were reported; however,
these events do not determine if the drive is good. The RAID subsystem,
including hard disk drives, firmware, and device drivers, are designed to handle
and correct many error conditions. Severe errors cause the hard disk drive to be
marked as Failed.
v Hard disk drive Predictive Failure Analysis (PFA)
This specification is designed into the hard disk drives to internally monitor and
diagnose a likely failure in the near future. When a hard disk drive flags itself with
a PFA, meaning that it expects to fail soon, the drive periodically sends a
message to the controller with this status. A PFA alert is displayed in several
ways:
– The system management (IMM) logs, if the server has a hot-swap backplane
– The MSM software
– SMART alert in the respective operating system event logs
The hard disk drive issuing the PFA alert is subject to fail at anytime and is
replaceable as a warranty action as applicable.

© Copyright IBM Corp. 2010 17


18 SAS Host Bus Adapters: Problem Determination and Service Guide
Chapter 7. Problem determination procedures
Problem determination procedures have several starting points, depending on the
indicator that alerts you to a problem within the subsystem. The troubleshooting
paths are as follows:
v Light path diagnostics LEDs-to-actions
v POST messages-to-actions
v Event messages-to-actions
v Symptom-to-actions
v System events-to-actions

If you cannot diagnose and correct a problem by using the information in this
chapter, see Appendix A, “Getting help and technical assistance,” on page 57 for
more information.

Hard disk drive LED-to-action


Light path diagnostics LEDs on the front panel of the server indicate symptoms
within the entire system; however, this Problem Determination and Service Guide is
focusing only on the LEDs that are relative to the storage subsystem. The front
panel display and bezel LEDs are used to solve hard disk drive problems.

If the hard disk drive status LED is lit, it means that an out-of-band alert for the
RAID controller was posted to the system-event logs. These messages are helpful
for remote administration and alert automation; however, when you are
troubleshooting hard drive issues from the front of the system, use the following
table to review the LED behaviors and take the applicable actions.
Table 1. Hard disk drive LED-to-action
Symptom Action
A hard disk drive has failed, and 1. Replace the failed hard disk drive that has an amber LED that is lit.
the associated amber hard disk
2. Observe the drive LEDs for normal operation. The amber LED turns off, and
drive status LED is steady.
the green activity LED flashes while the hard disk drive is accessed by the
controller.

© Copyright IBM Corp. 2010 19


Table 1. Hard disk drive LED-to-action (continued)
Symptom Action
An installed hard disk drive is 1. Is the hard disk drive amber LED lit or off?
not recognized.
v If the LED is lit, it indicates a drive fault.
v If the LED is off, the drive is working correctly.
2. If the amber LED is lit, remove the drive from the bay, wait 45 seconds, and
reinsert the drive, making sure that the tray latches correctly to the system
chassis.
3. Observe the associated green hard disk drive activity LED and the amber
status LED:
v If the green activity LED is flashing and the amber status LED is not lit, the
drive is recognized by the controller and is working correctly. Run Dynamic
System Analysis (DSA) to determine whether the drive is detected.
v If the green activity LED is flashing and the amber status LED is flashing
slowly, the drive is recognized by the controller and is rebuilding.
v If neither LED is lit or flashing when the drive is inserted, the hard disk
drive backplane might not have the correct power (go to step 4).
v If the green activity LED is flashing and the amber status LED is lit, replace
the drive. If the activity of the LEDs remains the same, go to step 4. If the
activity of the LEDs changes, return to step 1.
4. Make sure that the hard disk drive backplane is correctly seated. When it is
correctly seated, the drive assemblies correctly connect to the backplane
without bowing or causing movement of the backplane.
5. Move the hard disk drives to different bays to determine whether the drive or
the backplane is not functioning.
6. Reseat the backplane power cables. Make sure that the cables are connected
from the backplane to the server correctly. These black cables are keyed for
correct installation and they latch when connected securely. Repeat steps 1
through 3.
7. Reseat the SAS/SATA cable connections. SAS/SATA cables latch and click
when securely connected to the controller, backplane, or hard disk drive.
Repeat steps 1 through 3.
8. Suspect the backplane signal cable or the backplane. If the server has eight
hot-swap bays:
a. Replace the affected SAS/SATA cable.
b. Replace the affected SAS backplane.
9. Run the DSA tests for the SAS controller and hard disk drives:
v If the controller passes the test but the drives are not recognized, replace
the backplane signal cable and run the tests again.
v Replace the backplane.
v If the controller fails the test, disconnect the backplane signal cable from
the controller and run the tests again.
v If the controller fails the test, replace the controller.
10. If the problems cannot be corrected with these steps, contact IBM support.
Multiple hard disk drives fail. Make sure that the hard disk drive, SAS RAID controller, and server device drivers
and firmware are at the latest level.
Important: Some cluster solutions require specific code levels or coordinated
code updates. If the device is part of a cluster solution, verify that the latest level of
code is supported for the cluster solution before you update the code.

20 SAS Host Bus Adapters: Problem Determination and Service Guide


Table 1. Hard disk drive LED-to-action (continued)
Symptom Action
Multiple hard disk drives are 1. Review the storage subsystem logs for indications of problems within the
offline. storage subsystem, such as backplane or cable problems.
2. To identify the cause of the problem, collect the storage subsystem logs and
contact IBM support to review the logs and determine the corrective actions.
A replacement hard disk drive 1. Make sure that the hard disk drive is recognized by the controller (the green
does not rebuild. hard disk drive activity LED is flashing).
Compare the bezel location to the Preboot GUI utility or the MegaRAID Storage
Manager (MSM), or use the CLI command to determine the current state of the
device. You might have to configure the device before you use it.
2. If the Automatic Rebuild feature is disabled, the replacement drive will not
rebuild. You must configure the replaced drive as a spare.
3. The disk group/virtual drive (DG/VD) might have been protected by a hot spare
and rebuilt to an alternative device. Check the configuration and determine
whether the DG/VD is still degraded or whether another device is rebuilding.
4. Evaluate the hard disk drive LEDs by using the instructions in “A hard disk
drive is not detected after installation” in this table.
An amber hard disk drive status 1. If the amber hard disk drive LED and the RAID controller software do not
LED does not accurately indicate the same status for the drive, complete the following steps:
represent the actual state of the
a. Turn off the server.
associated drive.
b. Reseat the SAS controller.
c. Reseat the SAS/SATA cables.
d. Reseat the power cable connections to the backplane.
e. Reseat the SAS expander cables, if any are present.
f. Reseat the I2C cable from the backplane to the server.
g. Reseat the hard disk drive.
h. Turn on the server and observe the activity of the hard disk drive LEDs for
normal operation.
i. Move the drive to another bay, if possible, to see whether the symptom
stays with the drive.
2. If the problem remains:
v Replace the I2C cables.
v Replace the backplane.
3. If the problems cannot be corrected with these steps, contact IBM support.

POST messages-to-action
SAS HBA POST messages are displayed after server power-on but before the
operating system is loaded. POST messages do not appear during runtime
operations, because they usually describe unexpected events that are detected
between the previous shutdown and the most recent power on. Note all POST
messages and follow the suggested actions.

IBM System x servers use two types of system initialization code: older servers use
standard BIOS, and newer servers use UEFI. See your server documentation to
determine which system initialization code is used. These two environments have
different SAS HBA behaviors during the POST process.

Chapter 7. Problem determination procedures 21


In BIOS-based IBM servers, the SAS HBA displays a POST banner. While the
POST banner is displayed, new event messages are displayed to notify you of
events or pauses for events that require user interaction.

IBM UEFI-based servers require an operating system that is UEFI supported to take
full advantage of the new specification. Most IBM UEFI-based servers support a
legacy mode that emulates the standard BIOS for backward compatibility to legacy
operating systems that are not UEFI supported. When UEFI detects an operating
system that is not UEFI supported, the SAS HBA controllers display a POST
banner. If a native UEFI-supported operating system is installed, the SAS HBA
might not display a post banner during normal operation; however, critical POST
event messages are displayed.

Fault codes
The following fault codes might display during POST.
Table 2. Fault codes that display during POST
Fault code Description
0x01 NO_IO_PORT_ASSIGNED
0x02 MPT_FW_FAULT
0x03 NO_IMAGE_FOR_FWDLB
0x04 FWDLB_CHECKSUM_FAILED
0x05 IOC_HW_ERROR
0x06 MPT_FW_COMM_ERROR
0x07 PCI_BUS_MASTER_ERROR
0x08 STR_IMAGE_NOT_FOUND
0x09 STR_MEM_ALLOC_FAILED
0x0A STR_UPLOAD_FAILED
0x0B STR_INVALID_IMAGE
0x0C UNSUPPORTED_IOC_CONFIG
0x0D TIMEOUT_AWAITING_IOC_READY
0x0E TX_DB_HANDSHAKE_ERROR
0x0F RX_DB_HANDSHAKE_ERROR
0x10 NO_MMIO_ADDRESS_ASSIGNED
0x11 IOC_FACTS_FAILURE
0x12 IOC_INIT_FAILURE
0x13 PORT_ENABLE_FAILURE

If one of the fault codes in Table 2 is displayed during POST, try to recover the
server by completing the following steps:
1. Use the controller hardware checkout procedure (see step 4b on page 9) and
then check for correct operation.
2. If you can access the BIOS Configuration Utility, reset the controller to the
default settings by completing the following steps:
a. Press CTRL+C.
b. Press ALT+N and select Global Properties → Restore Defaults → Save
settings and Exit.

22 SAS Host Bus Adapters: Problem Determination and Service Guide


Note: RAID volumes on the ServeRAID H1110 are not modified or removed
with the previous action.
After the controller is reset, check for correct operation.
3. If the server can boot to an operating system, flash the controller to the latest
version of software and firmware or reflash the controller to the same code
levels, then retry for normal operation.
4. Power-off the server and temporarily remove all cables from the controller.
Power-on the server and observe for correct operation, or determine if the fault
continues to occur or has changed.
v If the fault continues without cables or hard disk drives attached, replace the
controller.
v If the fault does not occur, power off the server and attach one cable at a
time restarting to observe for the problem. Determine if the fault is isolated to
a controller channel, cable, or hard disk drive by modifying the configuration
and swapping cables, channels, and drives. Replace the component that has
failed.

Chapter 7. Problem determination procedures 23


Boot messages
The SAS HBA boot messages are described in the following table.
Table 3. SAS HBA boot messages
Event ID STR_SAS_ADDRESS_ZERO
Message displayed SAS Address NOT programmed on controller in slot #
Suggested actions This message is displayed if no SAS address can be obtained for the adapter.
Controllers require a SAS address to operate correctly. The address is lost. Under
normal usage, this usually causes a hardware malfunction. If this occurred during a
firmware update, retry the update, and then replace the controller if the error persists.

Event ID STR_INSTALL_FAIL
Message displayed LSI Corporation MPT boot ROM, no supported devices found!
Suggested actions This message is displayed if the BIOS does not discover any devices capable of INT13
control on any compatible adapter initialized in its adapter scan.

Event ID STR_UNSUPPORTED_DEVICE
Message displayed One or more unsupported device detected!
Suggested actions This message is displayed if discovery status for a port detects a device that firmware
has flagged as unsupported. Check for unsupported devices that are attached to the
controller.

Event ID STR_DEVICE_NOT_AVAILABLE
Message displayed Device not available at <Bus/TID/LUN>
Suggested actions This message is displayed when the core BIOS fails to get a device to spin up in enough
time to access its information, and the BIOS has been configured to flag this condition
as a hard error.

Event ID STR_DEVICES_SPINNING_UP
Message displayed Devices in the process of spinning up!
Suggested actions This message is displayed when the core BIOS fails to get a device to spin up in enough
time to access its information, and the BIOS has been configured to flag this condition
as a warning.

Event ID STR_BOOT_DEVICE_SPINNING_UP
Message displayed Please wait, spinning up the boot device!
Suggested actions This message is displayed if the first INT13 device controlled by the server BIOS
requires a command to be issued to it before the device is ready for I/O activity. Check
for compatibility of the controller in the server. Restart the server and then retry. Update
the system code and then retry.

Event ID STR_TOO_MANY_DEVICES
Message displayed Failed to add device, too many devices!

24 SAS Host Bus Adapters: Problem Determination and Service Guide


Table 3. SAS HBA boot messages (continued)
Suggested actions This message is displayed if the maximum number of devices that the BIOS can support
is reached, and there are additional devices remaining to be scanned. Usually this is the
case when a large number of INT13 devices are connected to the controller and you
attempt to boot from a CD drive (that is also connected to the controller). Remove any
unsupported devices and make sure that only the supported numbers of devices are
attached. For more information, see the controller User's Guide.

Event ID STR_BUS_MASTER_ERROR
Message displayed Bus Master ERROR!
Suggested actions This message is displayed if the PCI bus mastering bit was not enabled by the BIOS.

Event ID STR_ADAPTER_MALFUNCTION
Message displayed ERROR! Adapter Malfunctioning!
Suggested actions This message is displayed when the core BIOS fails to get the firmware into an
operational state (including performing a hardware reset). Flash or reflash the controller
to the latest code, if possible. If the problem persists, replaced the controller.

Event ID STR_ADAPTER_REMOVED
Message displayed Adapter removed from boot order!
Suggested actions This message is displayed if the boot order detected on the first adapter contains invalid
or missing entries. If multiple adapters are installed, check or configure the server and
controller boot options, and retry for normal operation.

Event ID STR_BOOT_ORDER_INVALID
Message displayed Updating Adapter List!
Suggested actions This message is displayed if multiple adapters are installed and the first adapter in the
boot order sequence does not contain a valid boot device. The first adapter will perform
an update to its internal adapter list so that the next adapter in the boot order can start
the server.

If this message continues to be displayed, check and configure the server boot options.
To change the startup controller or disk, press F1 at server startup and select Boot
Manager.

Event ID STR_ADAPTER_DISABLED
Message displayed Adapter(s) disabled by user
Suggested actions This message is displayed when an adapter is detected that is intentionally configured to
be disabled from BIOS control by settings in the BIOS Configuration Utility (CU).

Event ID STR_IR_EXCEPTION
Message displayed Integrated RAID exception detected: Volume (Hdl:###) is in state N
Suggested actions This message is displayed if a volume is detected in a non-optimal state. The ### is the
internal device handle assigned to the volume. The value for “N” can be either Inactive,
followed by the specific reason for the volume being in an Inactive state, or it can be the
Non-Optimal state value that the volume is currently reporting.

Chapter 7. Problem determination procedures 25


Table 3. SAS HBA boot messages (continued)
Event ID STR_IR_VOL_FOREIGN_METADATA
Message displayed WARNING! Foreign Metadata detected
Suggested actions This message is displayed only if the integrated RAID firmware detects metadata on a
device that is not compatible with the current firmware implementation.

26 SAS Host Bus Adapters: Problem Determination and Service Guide


Event messages-to-actions
Event messages are found in the MegaRAID Storage Manager (MSM) application.
Events that are logged into an operating-system event log usually have a
correlating MSM event log entry. This section lists the MSM events that might
appear in the event log.

MSM software monitors the activity and performance of all controllers in the server
and the devices that are attached to them. When an event occurs, such as the start
of an initialization, an event message is displayed in the log at the bottom of the
MSM window.

Each message in the event log has an event type that indicates the severity of the
event, as shown in the following table.
Table 4. MSM event types and descriptions
Event type Description
Information Informational message. No user action is necessary.
Warning Some component might be close to a failure point.
Critical A component has failed, but the server has not lost data.
Fatal A component has failed, and data loss has occurred or will occur.

All of the MSM event messages are listed in the following table. Each event
description includes one or more placeholders for specific values that are
determined when the event is generated. For example, in message 0x0001 in
Table 5, the value %s is replaced by the firmware version, which is read from the
firmware when the event is generated.
Table 5. MSM event messages-to-action
Number Type Event description Suggested actions
0x0001 Information MegaRAID firmware version %s
0x0004 Information Configuration cleared
0x0005 Warning Cluster down; communication with Clustering is not supported by IBM
peer lost
0x0006 Information Virtual drive %s ownership changed
from %02x to %02x
0x0007 Information Alarm disabled by user
0x0008 Information Alarm enabled by user
0x0009 Information Background initialization rate
changed to %d%%
0x000a Fatal Controller cache discarded due to The message is probably the result of a bad
memory/battery problems battery.
1. Replace the battery.
2. Replace the controller.
0x000c Information Cache data recovered successfully
0x000d Fatal Controller cache discarded due to The cache write operations are firmware
firmware version incompatibility sensitive and might not be compatible with
different versions of code. Return the controller
to the previously used firmware version and
retry. Update the adapter and enclosure unit
firmware.

Chapter 7. Problem determination procedures 27


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x000e Information Consistency check rate change to
%d%%
0x000f Fatal Fatal firmware error: %s The firmware error %s states which device had
the error. This is expected to be a controller
event.
1. Update the adapter and enclosure unit
firmware.
2. Replace the device.
0x0010 Information Factory defaults restored
0x0011 Information Flash downloaded image corrupt
0x0012 Caution Flash erase error Update the adapter and enclosure unit
firmware.
0x0013 Caution Flash timeout during erase Update the adapter and enclosure unit
firmware.
0x0014 Caution Flash error Update the adapter and enclosure unit
firmware.
0x0015 Information Flashing image: %s
0x0016 Information Flash of new firmware images
complete
0x0017 Caution Flash programming error Update the adapter and enclosure unit
firmware.
0x0018 Caution Flash timeout during programming Update the adapter and enclosure unit
firmware.
0x0019 Caution Flash chip type unknown Update the adapter and enclosure unit
firmware.
0x001a Caution Flash command set unknown Update the adapter and enclosure unit
firmware.
0x001b Caution Flash verify failure Update the adapter and enclosure unit
firmware.
0x001c Information Flush rate changed to %d seconds
0x001d Information Hibernate command received from
host
0x001e Information Event log cleared
0x001f Information Event log wrapped
0x0020 Fatal Multi-bit ECC error: ECAR=%x, A multi-bit ECC error refers to the memory
ELOG=%x, (%s) cache on the controller. Replace the controller.
0x0021 Warning Single-bit ECC error: ECAR=%x, A single-bit ECC error refers to the memory on
ELOG=%x, (%s) the controller; however, the ECC recovered
from the error. Replace the controller if the
event repeats on a regular basis. By design,
the controller memory can recover from a
singe-bit error.
0x0022 Fatal Not enough controller memory Replace the controller.
0x0023 Information Patrol Read complete
0x0024 Information Patrol Read paused
0x0025 Information Patrol Read Rate changed to %d%%

28 SAS Host Bus Adapters: Problem Determination and Service Guide


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x0026 Information Patrol Read resumed
0x0027 Information Patrol Read started
0x0028 Information Rebuild rate changed to %d%%
0x0029 Information Reconstruction rate changed to
%d%%
0x002a Information Shutdown command received from
host
0x002b Information Test event: %s
0x002c Information Time established as %s; (%d
seconds since power on)
0x002d Information User entered firmware debugger
0x002e Warning Background Initialization aborted on Investigate other events to determine the cause
%s of this event. A procedural, environmental, or
physical problem within the subsystem might
have caused this event. This is usually a
symptom of another problem.
0x002f Warning Background Initialization corrected By design, the controller and, usually, the hard
medium error (%s at %lx) disk drive correct medium errors. No data is
lost with a redundant virtual disk, but there
might be a small exposure to data loss in a
RAID-0 configuration when the physical
medium error is corrected but the data that was
stored at the location was not recovered. The
controller automatically corrects this exposure
within redundant virtual disks.
0x0030 Information Background Initialization completed
on %s
0x0031 Fatal Background initialization completed Replace hard disk drive %s.
with uncorrectable errors on %s
0x0032 Fatal Background initialization detected If the events are targeted to the same hard disk
uncorrectable double medium errors drive, replace the drive. If the events point to
(%s at %lx on %s) two or more drives, investigate other events to
determine the cause of this event. A
procedural, environmental, or physical problem
within the subsystem might have caused this
event. This might be a symptom of another
problem.
1. Evaluate previous events to determine
trending problems with physical devices.
2. If trending problems span multiple devices,
check and reseat cable and device
connections.
3. If trending problems are isolated to one
device, replace that device.
4. Manually begin a consistency check and
allow that process to be completed.
5. Evaluate the actions and conditions that
exhibited the problem, or observe for
normal behavior.

Chapter 7. Problem determination procedures 29


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x0033 Caution Background Initialization failed on %s If the events are targeted to the same hard disk
drive, replace the drive. If the events point to
two or more drives, investigate other events to
determine the cause of this event. A
procedural, environmental, or physical problem
within the subsystem might have caused this
event. This might be a symptom of another
problem.
1. Evaluate previous events to determine
trending problems with physical devices.
2. If trending problems span multiple devices,
check and reseat cable and device
connections.
3. If trending problems are isolated to one
device, replace that device.
4. Manually begin a consistency check and
allow that process to be completed.
5. Evaluate the actions and conditions that
exhibited the problem, or observe for
normal behavior.
0x0034 Progress Background initialization progress on
%s is %s
0x0035 Information Background initialization started on
%s
0x0036 Information Policy change on %s from %s to %s
0x0038 Warning Consistency Check aborted on %s A consistency check automatically aborts if the
disk group or virtual disk becomes critical or
offline for some other reason, or if a user
modifies an existing virtual disk. Evaluate the
server for these types of changes.
0x0039 Warning Consistency Check corrected By design, the controller and, usually, the hard
medium error (%s at %lx) disk drive correct medium errors. No data is
lost with a redundant virtual disk, but there
might be a small exposure to data loss in a
RAID-0 configuration when the physical
medium error is corrected but the data that was
stored at the location was not recovered. The
controller automatically corrects this exposure
within redundant virtual disks.
0x003a Information Consistency Check done on %s
0x003b Information Consistency Check done with
corrections on %s

30 SAS Host Bus Adapters: Problem Determination and Service Guide


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x003c Fatal Consistency Check detected If the events are targeted to the same hard disk
uncorrectable double medium errors drive, replace the drive. If the events point to
(%s at %lx on %s) two or more drives, investigate other events to
determine the cause of this event. A
procedural, environmental, or physical problem
within the subsystem might cause this event.
This might be a symptom of another problem.
1. Evaluate previous events to determine
trending problems with physical devices.
2. If trending problems span multiple devices,
check and reseat cable and device
connections.
3. If trending problems are isolated to one
device, replace that device.
4. Run a new consistency check and evaluate
whether it is completed correctly.
.
0x003d Caution Consistency Check failed on %s If the events are targeted to the same hard disk
drive, replace the drive. If the events point to
two or more drives, investigate other events to
determine the cause of this event. A
procedural, environmental, or physical problem
within the subsystem might cause this event.
This might be a symptom of another problem.
1. Evaluate previous events to determine
trending problems with physical devices.
2. If trending problems span multiple devices,
check and reseat cable and device
connections.
3. If trending problems are isolated to one
device, replace that device.
4. Run a new consistency check and evaluate
whether it is completed correctly.
0x003e Fatal Consistency Check completed with Investigate other events to determine the cause
uncorrectable data on %s of this event. A procedural, environmental, or
physical problem within the subsystem might
cause this event. This is typically a symptom of
another problem.

Data has been lost, so the order of recovery is


as follows:
1. Determine the cause of the failure by
evaluating other event entries.
2. Take the applicable corrective action for the
failure mode.
3. Determine what data was lost.
4. Recover the hardware (re-create logical
drives if necessary).
5. Restore data.

Chapter 7. Problem determination procedures 31


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x003f Warning Consistency Check found Inconsistencies on a logical drive do not always
inconsistent parity on %s at strip %lx cause data loss, but they can lead to data loss
over an extended period of time.

Investigate other events to determine the cause


of this event. A procedural, environmental, or
physical problem within the subsystem might
cause this event. This is typically a symptom of
another problem.
1. This event might occur if patrol read is
disabled for long periods of time. Enable
patrol read.
2. Evaluate previous events to determine
trending problems with physical devices.
3. If trending problems span multiple devices,
check and reseat cable and device
connections.
4. If trending problems are isolated to one
device, replace that device.
5. Run a new consistency check and evaluate
whether it is completed correctly.
0x0040 Warning Consistency Check inconsistency Inconsistencies on a logical drive do not always
logging disabled on %s (too many cause data loss, but they can lead to data loss
inconsistencies) over an extended period of time.

Investigate other events to determine the cause


of this event. A procedural, environmental, or
physical problem within the subsystem might
cause this event. This is typically a symptom of
another problem.
1. This event might occur if patrol read is
disabled for long periods of time. Enable
patrol read.
2. Evaluate previous events to determine
trending problems with physical devices.
3. If trending problems span multiple devices,
check and reseat cable and device
connections.
4. If trending problems are isolated to one
device, replace that device.
5. Run a new consistency check and evaluate
whether it is completed correctly.
0x0041 Progress Consistency Check progress on %s
is %s
0x0042 Information Consistency Check started on %s
0x0043 Warning Initialization aborted on %s This might be the result of an action by
firmware or the result of user actions. Evaluate
the previous procedures that were used to start
and stop the initialization of a new or existing
virtual disk. An initialization action is
automatically stopped if the array becomes
critical or offline or if a user interrupts this
process.

32 SAS Host Bus Adapters: Problem Determination and Service Guide


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x0044 Caution Initialization failed on %s This might be the result of an action by
firmware or the result of user actions. Evaluate
previous procedures that were used to start
and stop the initialization of a new or existing
virtual disk. An initialization action is
automatically stopped if the array becomes
critical or offline, or if a user interrupts this
process.
0x0045 Progress Initialization progress on %s is %s
0x0046 Information Fast initialization started on %s
0x0047 Information Full initialization started on %s
0x0048 Information Initialization complete on %s
0x0049 Information LD Properties updated to %s (from
%s)
0x004a Information Reconstruction complete on %s
0x004b Fatal Reconstruction of %s stopped due to A rebuild stopped abnormally. An
unrecoverable errors environmental, procedural, or physical device
problem within the subsystem caused this
event. Investigate other logged events to
determine the cause of the problem.
1. Evaluate previous events to determine
trending problems with physical devices.
2. If trending problems span multiple devices,
check and reseat cable and device
connections.
3. If trending problems are isolated to one
device, replace that device.
4. Manually begin a new rebuild and allow that
process to be completed.
5. Evaluate the actions and conditions that
exhibited the problem, or observe for
normal behavior.

Chapter 7. Problem determination procedures 33


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x004c Fatal Reconstruct detected uncorrectable The rebuild process detected multiple error
double medium errors (%s at %lx on conditions within the disk group, and some data
%s at %lx) might be lost or inaccessible. This error halts
the rebuild operation and identifies the drives
that are reporting the errors. Investigate other
logged events to determine trending errors on
the identified drives and replace drives as
necessary.
1. Evaluate previous events to determine
trending problems with physical devices.
2. If trending problems span multiple devices,
check and reseat cable and device
connections.
3. If trending problems are isolated to one or
several devices, replace the devices.
4. Determine whether data was lost.
5. Recover the hardware (re-create logical
drives if necessary).
6. Restore data, if necessary.
7. Evaluate the actions and conditions that
exhibited the problem, or observe for
normal behavior.
0x004d Progress Reconstruction progress on %s is %s
0x004e Information Reconstruction resumed on %s
0x004f Fatal Reconstruction resume of %s failed The configuration might have changed after a
due to configuration mismatch rebuild operation was started and before it was
completed. This can be the result of an action
by firmware or the result of some user
intervention. Evaluate the previous procedures
that were used to start and stop the
consistency check. Investigate previous related
logged events to determine the cause of this
event.

A rebuild operation is automatically stopped if


another drive fails or goes offline for some
other reason, or if a user interrupts this
process.

If the user made physical changes to the


subsystem while the server was powered off,
the controller might not be able to resume a
rebuild operation from a previous runtime
session.
0x0050 Information Reconstructing started on %s
0x0051 Information State change on %s from %s to %s
0x0052 Information Drive Clear aborted on %s

34 SAS Host Bus Adapters: Problem Determination and Service Guide


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x0053 Caution Drive Clear failed The physical drive could not initialize or clear
its data structures. This might be a
communication problem with the disk drive, or
there is a low probability that the drive might be
defective.
1. Evaluate previous events to determine
trending problems with physical devices.
2. If trending problems span multiple devices,
check and reseat cable and device
connections.
3. Move the drive to a different drive bay
location, if another drive bay is available.
4. Update the hard disk drive firmware.
5. Run Dynamic System Analysis on the hard
disk drive to determine its status.
6. If the hard disk drive fails the diagnostic test
or trending problems are isolated to one
device, replace that device.
7. Try again to clear the drive, and evaluate
whether this was completed correctly.
0x0054 Progress Drive Clear progress on %s is %s
0x0055 Information Drive Clear started on %s
0x0056 Information Drive Clear completed on %s
0x0057 Warning Error on %s (Error %02x) The indicated drive has errors. Monitor the
drive and consider replacing it if the error count
is excessive.
0x0058 Information Format complete on %s
0x0059 Information Format started on %s
0x005a Caution Hot Spare SMART polling failed The controller polled the identified device for
SMART events, and the polling operation failed.
This might be a communication problem with
the disk drive, or there is a low probability that
the drive might be defective.
1. Evaluate previous events to determine
trending problems with physical devices.
2. If trending problems span multiple devices,
check and reseat cable and device
connections.
3. Move the drive to a different drive bay
location, if another drive bay is available.
4. Update the hard disk drive firmware.
5. Run Dynamic System Analysis on the hard
disk drive to determine its status.
6. If the hard disk drive fails the diagnostic test
or trending problems are isolated to one
device, replace that device.
7. Try again to clear the drive, and evaluate
whether this was completed correctly.
0x005b Information Drive inserted: %s
0x005c Warning Drive %s is not supported

Chapter 7. Problem determination procedures 35


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x005d Warning Patrol Read corrected medium error
on %s at %lx
0x005e Progress Patrol Read progress on %s is %s
0x005f Fatal Patrol Read found an uncorrectable The controller background patrol read operation
medium error on %s at %lx found a media error on the identified drive, and
the medium error cannot be corrected. This
event usually causes the drive to be marked as
failed.
1. Update the hard disk drive firmware.
2. Manually begin a consistency check and
evaluate the drive for normal operation.
3. If the problems remain, replace the drive at
%s.
0x0060 Caution Predictive failure: CDB: %s The drive has reached its internal error
threshold and has sent a SMART alert
(Predictive Failure Analysis alert) to the
controller. The hard disk drive remains
operational until it is marked as failed; however,
the drive is predicted to fail soon. Replace the
drive.
0x0061 Fatal Patrol Read puncturing bad block on The technique of “puncturing bad block” is an
%s at %lx earlier method of managing bad blocks and is
used only with older firmware versions.
Upgrade the firmware as soon as possible. Run
a consistency check to convert the bad block
management into the current method, which is
managed with a “bad block table.” Puncturing
bad blocks is not supported on storage devices
that are larger than 72 GB.
0x0062 Information Rebuild aborted by user on %s
0x0063 Information Rebuild complete on %s
0x0064 Information Rebuild complete on %s
0x0065 Caution Rebuild failed on %s due to source The rebuild failed because of errors in the
drive error redundant data. In a critical mirror (RAID-1, 10)
the source of the error is the online disk
member. The drive at %s is the drive that is
rebuilding, which is usually an operational disk.
Make sure that there is a recent backup of the
data, and try the rebuild operation again. If the
rebuild operation continues to fail, you might
have to re-create the disk group or virtual disks.
There are usually other errors in the event logs
that identify the source disk with an
unrecoverable error.
Note: This error might indicate some data loss
because the error occurs while the disk group
or virtual disk is critical and nonredundant.
1. Update the hard disk drive firmware.
2. Manually begin a new rebuild operation and
evaluate the drive for normal operation.
3. If the problem remains, replace the source
drive.

36 SAS Host Bus Adapters: Problem Determination and Service Guide


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x0066 Caution Rebuild failed on %s due to target The rebuild operation failed because of errors
drive error on the rebuilding hard disk drive. Replace the
drive at %s.
0x0067 Progress Rebuild progress on %s is %s
0x0068 Information Rebuild resumed on %s
0x0069 Information Rebuild started on %s
0x006a Information Rebuild automatically started on %s
0x006b Caution Rebuild stopped on %s due to loss IBM does not support clustering.
of cluster ownership
0x006c Fatal Reassign write operation failed on Replace hard disk drive %s.
%s at %lx
0x006d Fatal Unrecoverable medium error during Replace hard disk drive %s.
rebuild on %s at %lx
0x006e Information Corrected medium error during
recovery on %s at %lx
0x006f Fatal Unrecoverable medium error during Replace hard disk drive %s.
recovery on %s at %lx
0x0070 Information Drive removed: %s
0x0071 Warning Unexpected sense: %s, CDB%s, As commands are sent to the hard disk drives,
Sense: %s the controller expects applicable responses.
When this error occurs, the controller received
an unexpected response to a command. In a
very busy server, this might occur infrequently
without any significant problems; however,
repeated events might indicate some level of
incompatibility.
1. Update the hard disk drive firmware.
2. Update the controller firmware and device
drivers.
3. Reboot the server (the reset might clear the
problem).
0x0072 Information State change on %s from %s to %s
0x0073 Information State change by user on %s from %s
to %s
0x0074 Warning Redundant path to %s is broken There is a communication problem within the
SAS/SATA subsystem. The controller can no
longer communicate with the %s device.
1. If this is a new installation, see the
documentation for correct cabling
instructions.
2. Check cables for damage, and reseat each
cable.
3. If the problem remains, replace the
applicable cables within the identified
connections.
4. Replace the backplane.
Note: Swap cables to identify the bad
component through the process of elimination
as needed.

Chapter 7. Problem determination procedures 37


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x0075 Information Redundant path to %s restored
0x0076 Information Dedicated Hot Spare Drive %s no
longer useful due to deleted drive
group
0x0077 Caution SAS topology error: Loop detected There is a communication problem within the
SAS/SATA subsystem. The controller can no
longer communicate with the %s device.
1. If this is a new installation, see the
documentation for the correct cabling
instructions.
2. Check cables for damage, and reseat each
cable.
3. If the problem remains, replace the
applicable cables within the identified
connections.
4. Replace the backplane.
Note: Swap cables to identify the bad
component through the process of elimination
as needed.
0x0078 Caution SAS topology error: Unaddressable There is a communication problem within the
device SAS/SATA subsystem. The controller cannot
address the %s device.
1. If this is a new installation, see the
documentation for the correct cabling
instructions.
2. Check cables for damage, and reseat each
cable.
3. If the problem remains, replace the
applicable cables within the identified
connections.
4. Replace the backplane.
Note: Swap cables to identify the bad
component through the process of elimination
as needed.
0x0079 Caution SAS topology error: Multiple ports to There is a communication problem within the
the same SAS address SAS/SATA subsystem. The controller detected
multiple ports to the same SAS address.
1. If this is a new installation, see the
documentation for the correct cabling
instructions.
2. Check cables for damage, and reseat each
cable.
3. If the problem remains, replace the
applicable cables within the identified
connections.
4. Replace the backplane.
Note: Swap cables to identify the bad
component through the process of elimination
as needed.

38 SAS Host Bus Adapters: Problem Determination and Service Guide


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x007a Caution SAS topology error: Expander error There is a communication problem within the
SAS/SATA subsystem. The controller detected
a problem with the expander on the backplane.
1. If this is a new installation, see the
documentation for the correct cabling
instructions.
2. Check cables for damage, and reseat each
cable.
3. Update the backplane firmware (if an
update is available).
4. Replace the backplane.
0x007b Caution SAS topology error: SMP timeout There is a communication problem within the
SAS/SATA subsystem. The controller detected
problems with SMP commands timing out.
1. If this is a new installation, see the
documentation for the correct cabling
instructions.
2. Check cables for damage, and reseat each
cable.
3. If the problem remains, replace the
applicable cables within the identified
connections.
4. Replace the backplane.
Note: Swap cables to identify the bad
component through the process of elimination
as needed.
0x007c Caution SAS topology error: Out of route There is a communication problem within the
entries SAS/SATA subsystem. The controller can no
longer communicate with the devices.
1. If this is a new installation, see the
documentation for the correct cabling
instructions.
2. Check cables for damage, and reseat each
cable.
3. If the problem remains, replace the
applicable cables within the identified
connections.
4. Replace the backplane.
Note: Swap cables to identify the bad
component through the process of elimination
as needed.

Chapter 7. Problem determination procedures 39


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x007d Caution SAS topology error: Index not found There is a communication problem within the
SAS/SATA subsystem. The controller cannot
locate a valid index of devices.
1. If this is a new installation, see the
documentation for the correct cabling
instructions.
2. Check cables for damage, and reseat each
cable.
3. If the problem remains, replace the
applicable cables within the identified
connections.
4. Replace the backplane.
Note: Swap cables to identify the bad
component through the process of elimination
as needed.
0x007e Caution SAS topology error: SMP function There is a communication problem within the
failed SAS/SATA subsystem. The controller detected
problems with the SMP operations.
1. If this is a new installation, see the
documentation for the correct cabling
instructions.
2. Check cables for damage, and reseat each
cable.
3. If the problem remains, replace the
applicable cables within the identified
connections.
4. Replace the backplane.
Note: Swap cables to identify the bad
component through the process of elimination
as needed.
0x007f Caution SAS topology error: SMP CRC error There is a communication problem within the
SAS/SATA subsystem. The controller detected
CRC errors in the SMP communication.
1. If this is a new installation, see the
documentation for the correct cabling
instructions.
2. Check cables for damage, and reseat each
cable.
3. If the problem remains, replace the
applicable cables within the identified
connections.
4. Replace the backplane.
Note: Swap cables to identify the bad
component through the process of elimination
as needed.

40 SAS Host Bus Adapters: Problem Determination and Service Guide


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x0080 Caution SAS topology error: Multiple There is a communication problem within the
subtractive SAS/SATA subsystem. The controller detected
multiple subtractive issues, which indicates that
there are problems with the external enclosure
unit cabling or the cables that chain the
external enclosure units together.
1. If this is a new installation, see the
documentation for the correct cabling
instructions.
2. Check cables for damage, and reseat each
cable.
3. If the problem remains, replace the
applicable cables within the identified
connections.
4. Update the backplane or enclosure unit
firmware (if a newer version is available).
5. Replace the backplane.
Note: Swap cables to identify the bad
component through the process of elimination
as needed.
0x0081 Caution SAS topology error: Table to table There is a communication problem within the
SAS/SATA subsystem. The controller detected
table-to-table issues, which usually indicates
that there are problems with the external
enclosure unit cabling or the cables that chain
the external enclosure units together.
1. If this is a new installation, see the
documentation for the correct cabling
instructions.
2. Check cables for damage, and reseat each
cable.
3. If the problem remains, replace the
applicable cables within the identified
connections.
4. Update the backplane or enclosure unit
firmware (if a newer version is available).
5. Replace the backplane.
Note: Swap cables to identify the bad
component through the process of elimination
as needed.

Chapter 7. Problem determination procedures 41


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x0082 Caution SAS topology error: Multiple paths There is a communication problem within the
SAS/SATA subsystem. The controller detected
multiple path issues, which usually indicates
that there are problems with the external
enclosure unit cabling or the cables that chain
the external enclosure unit together.
1. If this is a new installation, see the
documentation for the correct cabling
instructions.
2. Check cables for damage, and reseat each
cable.
3. If the problem remains, replace the
applicable cables within the identified
connections.
4. Replace the backplane.
Note: Swap cables to identify the bad
component through the process of elimination
as needed.
0x0083 Fatal Unable to access device %s There is a communication problem within the
SAS/SATA subsystem. The controller cannot
access the device %s.
1. If this is a new installation, see the
documentation for the correct cabling
instructions.
2. Check cables for damage, and reseat each
cable.
3. If the problem remains, replace the
applicable cables within the identified
connections.
4. Move device %s to a different slot (if
available).
5. Replace the backplane.
Note: Swap cables to identify the bad
component through the process of elimination
as needed.
0x0084 Information Dedicated Hot Spare created on %s
(%s)
0x0085 Information Dedicated Hot Spare %s disabled
0x0086 Caution Dedicated Hot Spare %s no longer Check the size of the spare and then review
useful for all drive groups the configuration to determine whether any
active drive members are larger than the spare.
If so, this event is correctly notifying you that a
larger dedicated spare is required to protect the
disk groups and virtual drives.
0x0087 Information Global Hot Spare created on %s
(%s)
0x0088 Information Global Hot Spare %s disabled

42 SAS Host Bus Adapters: Problem Determination and Service Guide


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x0089 Caution Global Hot Spare does not cover all Check the size of the spare and then review
drive groups the configuration to determine whether any
active drive members are larger than the spare.
If so, this event is correctly notifying you that a
larger global spare is required to protect the
disk groups and virtual drives.
0x008a Information Created %s}
0x008b Information Deleted %s}
0x008c Information Marking LD %s inconsistent due to
active writes at shutdown
0x008d Information Battery Present
0x008e Warning Battery Not Present Check the battery and the connection to the
adapter. Install the battery if it was removed.
0x008f Information New Battery Detected
0x0090 Information Battery has been replaced
0x0091 Caution Battery temperature is high Check the ambient temperature of the server.
Make sure that the environmental configuration,
temperatures, and airflow are correct for the
server and rack.
0x0092 Warning Battery voltage low 1. Begin a battery calibration and allow it to be
completed.
2. Observe for battery events and normal
operation.
3. If the problem remains, replace the battery.
0x0093 Information Battery started charging
0x0094 Information Battery is discharging
0x0095 Information Battery temperature is normal
0x0096 Fatal Battery needs to be replaced, SOH Replace the battery.
Bad
0x0097 Information Battery relearn started
0x0098 Information Battery relearn in progress
0x0099 Information Battery relearn completed
0x009a Caution Battery relearn timed out 1. Reseat the battery connections.
2. Begin a battery calibration and allow it to be
completed.
3. Observe for battery events and normal
operation.
4. If the problem remains, update the
controller firmware (if a new version is
available) and retry.
5. If the problem remains, replace the battery
and try again.
6. If the problem remains, replace the
controller.
0x009b Information Battery relearn pending: Battery is
under charge

Chapter 7. Problem determination procedures 43


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x009c Information Battery relearn postponed
0x009d Information Battery relearn will start in 4 days
0x009e Information Battery relearn will start in 2 day
0x009f Information Battery relearn will start in 1 day
0x00a0 Information Battery relearn will start in 5 hours
0x00a1 Information Battery removed
0x00a2 Information Current capacity of the battery is
below threshold
0x00a3 Information Current capacity of the battery is
above threshold
0x00a4 Information Enclosure (SES) discovered on %s
0x00a5 Information Enclosure (SAFTE) discovered on
%s
0x00a6 Caution Enclosure %s communication lost There is a communication problem within the
SAS/SATA subsystem. The controller lost
communication with the backplane or external
storage enclosure.
1. If this is a new installation, see the
documentation for the correct cabling
instructions.
2. Check cables for damage, and reseat each
cable.
3. If the problem remains, replace the
applicable cables within the identified
connections.
4. For an internal enclosure unit, replace the
backplane.
5. For an external enclosure unit, see the
enclosure unit documentation for more
information.
Note: Swap cables to identify the bad
component through the process of elimination
as needed.
0x00a7 Information Enclosure %s communication
restored
0x00a8 Caution Enclosure %s fan %d failed 1. Check the enclosure unit fan for correct
operation.
2. See the enclosure unit documentation for
more information.
0x00a9 Information Enclosure %s fan %d inserted 1. Check the enclosure unit fan for correct
operation.
2. See the enclosure unit documentation for
more information.
0x00aa Caution Enclosure %s fan %d removed 1. Check the enclosure unit fan for correct
operation.
2. See the enclosure unit documentation for
more information.

44 SAS Host Bus Adapters: Problem Determination and Service Guide


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x00ab Caution Enclosure %s power supply %d 1. Check the enclosure unit power supply for
failed correct operation.
2. See the enclosure unit documentation for
more information.
0x00ac Information Enclosure %s power supply %d
inserted
0x00ad Caution Enclosure %s power supply %d 1. Check the enclosure unit power supply for
removed correct operation.
2. See the enclosure unit documentation for
more information.
0x00ae Caution Enclosure %s SIM %d failed See the enclosure unit documentation for more
information.
0x00af Information Enclosure %s SIM %d inserted
0x00b0 Caution Enclosure %s SIM %d removed See the enclosure unit documentation for more
information.
0x00b1 Warning Enclosure %s temperature sensor Check the ambient temperature of the server.
%d below Warning threshold Make sure that the environmental configuration,
temperatures, and airflow are correct for the
server and rack.
0x00b2 Caution Enclosure %s temperature sensor Check the ambient temperature of the server.
%d below error threshold Make sure that the environmental configuration,
temperatures, and airflow are correct for the
server and rack.
0x00b3 Warning Enclosure %s temperature sensor Check the ambient temperature of the server.
%d above Warning threshold Make sure that the environmental configuration,
temperatures, and airflow are correct for the
server and rack.
0x00b4 Caution Enclosure %s temperature sensor Check the ambient temperature of the server.
%d above error threshold Make sure that the environmental configuration,
temperatures, and airflow are correct for the
server and rack.
0x00b5 Caution Enclosure %s shutdown This might be an expected event if the
enclosure unit was intentionally shut down.
Otherwise, check the enclosure unit power and
cables. See the enclosure unit documentation
for more information.
0x00b6 Warning Enclosure unit %s is not supported; There is a communication problem within the
too many enclosure units are enclosure unit subsystem. The controller
connected to a port detected too many enclosure units connected
to a port. This usually indicates problems with
external enclosure unit cabling or the cables
that chain the external enclosure units together.
See the enclosure unit documentation for more
information.
0x00b7 Caution Enclosure unit %s firmware Update the controller and enclosure unit
mismatch firmware until they operate together correctly.
0x00b8 Warning Enclosure %s sensor %d bad A sensor cannot report its status or has failed.
If it has failed, some information might be
missing from the management of the unit. See
the enclosure unit documentation to determine
which part should be serviced.

Chapter 7. Problem determination procedures 45


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x00b9 Caution Enclosure %s phy %d bad PHY is the physical port connector. Check the
cable from the adapter to the enclosure unit.
Check the enclosure unit power. Examine the
enclosure unit LEDs.
0x00ba Caution Enclosure %s is unstable 1. Check the enclosure unit LEDs for correct
operation.
2. See the enclosure unit documentation for
more information.
0x00bb Caution Enclosure %s hardware error 1. Check the enclosure unit LEDs for correct
operation.
2. See the enclosure unit documentation for
more information.
0x00bc Caution Enclosure %s not responding 1. Check the enclosure unit LEDs for correct
operation.
2. See the enclosure unit documentation for
more information.
0x00bd Information SAS/SATA mixing not supported in
enclosure; Drive %s disabled
0x00be Information Enclosure (SES) hotplug on %s was
detected, but is not supported
0x00bf Information Clustering enabled IBM does not support clustering.
0x00c0 Information Clustering disabled IBM does not support clustering.
0x00c1 Information Drive too small to be used for
auto-rebuild on %s
0x00c4 Warning Bad block table on drive %s is 80% This event causes the %s hard disk drive to be
full marked as failed. Replace the drive at %s.
0x00c5 Fatal Bad block table on drive %s is full; This event causes the %s hard disk drive to be
unable to log block %lx marked as failed. Replace the drive at %s.
0x00c6 Information Consistency Check Aborted due to
ownership loss on %s
0x00c7 Information Background Initialization (BGI)
Aborted Due to Ownership Loss on
%s
0x00c8 Caution Battery/charger problems detected; 1. Reseat the battery connections.
SOH Bad
2. Begin a battery calibration and allow it to be
completed.
3. Observe for battery events and normal
operation.
4. If the problem remains, update the
controller firmware (if a new version is
available) and try again.
5. If the problem remains, replace the battery
and try again.
6. If the problem remains, replace the
controller.

46 SAS Host Bus Adapters: Problem Determination and Service Guide


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x00c9 Warning Single-bit ECC error: ECAR=%x, A single-bit ECC memory error threshold was
ELOG=%x, (%s); Warning threshold exceeded. The controller is alerting you that the
exceeded memory on the controller will probably fail soon.
Replace the controller.
Note: At this time there is no data loss.
0x00ca Caution Single-bit ECC error: ECAR=%x, A single-bit ECC memory error threshold was
ELOG=%x, (%s); critical threshold exceeded, and the number of errors is
exceeded excessive. The controller is alerting you that the
memory on the controller is failing rapidly. Shut
down the server and replace the controller as
soon as possible.
Note: At this time there is no data loss.
0x00cb Caution Single-bit ECC error: ECAR=%x, A single-bit ECC memory error threshold was
ELOG=%x, (%s); further reporting exceeded, and any new events are not logged.
disabled The controller is alerting you that the memory
on the controller is not trustworthy. Shut down
the server and replace the controller as soon
as possible.
Note: Additional single-bit ECC errors might
cause bad data, and these errors are not
reported.
0x00cc Caution Enclosure %s Power supply %d 1. Check the enclosure unit power supply for
switched off correct operation.
2. See the enclosure unit documentation for
more information.
0x00cd Information Enclosure %s Power supply %d
switched on
0x00ce Caution Enclosure %s Power supply %d 1. Check the enclosure unit power supply
cable removed cabling.
2. See the enclosure unit documentation for
more information.
0x00cf Information Enclosure %s Power supply %d
cable removed
0x00d0 Information Enclosure %s Power supply %d
cable inserted
0x00d4 Information NVRAM Retention test was initiated
on previous boot
0x00d5 Information NVRAM Retention test passed
0x00d6 Caution NVRAM Retention test failed! Replace the controller.
0x00d7 Information %s test completed %d passes
successfully
0x00d8 Caution %s test FAILED on %d pass. 1. Check the cables and connections.
2. Reseat the %s device.
3. If this is a new configuation, check the
device by swapping it to a new slot position
and determine whether the problem is the
device, backplane, or cable.
4. Observe the server for normal operation.
5. If the problem remains, replace the %s
device.

Chapter 7. Problem determination procedures 47


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x00d9 Information Self check diagnostics completed
0x00da Information Foreign Configuration Detected
0x00db Information Foreign Configuration Imported
0x00dc Information Foreign Configuration Cleared
0x00dd Warning NVRAM is corrupt; reinitializing Allow the controller enough time to try to
correct the problem programmatically. If the
controller is stopped or is unable to recover,
power off and disconnect the hard disk drives
from the controller. Power on the server and
start the WebBIOS. Clear the controller
configuration (reset to defaults).

If the controller stabilizes, reattach the drives


and allow the RAID configuration to import the
existing configuration. If the controller does not
stabilize, update the firmware on the controller.
0x00de Warning NVRAM mismatch occurred Allow the controller enough time to try to
correct the problem programmatically. If the
controller is stopped or is unable to recover,
power off and disconnect the hard disk drives
from the controller. Power-on the server and
start the WebBIOS. Clear the controller
configuration (reset to defaults).

If the controller stabilizes, reattach the drives


and allow the RAID configuration to import the
existing configuration. If the controller does not
stabilize, update the firmware on the controller.
0x00df Warning SAS wide port %d lost link on PHY The controller lost a wide port link. A wide port
%d link in this configuration usually means that a
4-channel connection to an expander-based
backplane or enclosure unit was downgraded
from 4 active channels to 3 active channels.
The controller will take a channel down when
too many errors occur on the link and will often
reset the channel and restore the link
programmatically. This is normal operation;
however, repeating events might indicate a
systemic problem within that connection.
0x00e0 Information SAS wide port %d restored link on
PHY %d
0x00e1 Warning SAS port %d, PHY %d has exceeded
the allowed error rate
0x00e2 Warning Bad block reassigned on %s at %lx
to %lx
0x00e3 Information Controller Hot Plug detected
0x00e4 Warning Enclosure %s temperature sensor An enclosure unit is reporting a temperature
%d differential detected threshold event.
0x00e5 Information Drive test cannot start. No qualifying
drives found
0x00e6 Information Time duration provided by host is not
sufficient for self check

48 SAS Host Bus Adapters: Problem Determination and Service Guide


Table 5. MSM event messages-to-action (continued)
Number Type Event description Suggested actions
0x00e7 Information Marked Missing for %s on drive
group %d row %d
0x00e8 Information Replaced Missing as %s on drive
group %d row %d
0x00e9 Information Enclosure %s Temperature %d
returned to normal
0x00ea Information Enclosure %s Firmware download in
progress
0x00eb Warning Enclosure %s Firmware download The controller is reporting from the enclosure
failed unit that the recent firmware update failed.
Review the firmware update procedure for the
enclosure unit and try the firmware update
again.
0x00ec Warning %s is not a certified drive The controller is reporting that an uncertified
drive is detected at a specific location. If the
drive is newly inserted in the controller, remove
the drive and check for any specific drive
requirements or labeling that is defined for this
solution.
0x00ed Information Dirty cache data discarded by user
0x00ee Information Drives missing from configuration at
boot
0x00ef Information Virtual drives (VDs) missing drives
and will go offline at boot: %s
0x00f0 Information VDs missing at boot: %s
0x00f1 Information Previous configuration completely
missing at boot
0x00f2 Information Battery charge complete
0x00f3 Information Enclosure %s fan %d speed
changed
0x0128 Information Cache discarded on offline virtual
drive. When a VD with cached data
goes offline or missing during
runtime, the cache for the VD is
discarded. Because the VD is offline,
the cache cannot be saved.

Chapter 7. Problem determination procedures 49


Symptoms-to-actions
This section lists common symptoms and suggested actions to take.

The SAS HBA is not seen during POST, or the Preboot GUI is not
accessible
Applicability:
v The server is BIOS-based, and the SAS HBA does not display a POST banner.
v The Preboot GUI configurator (Ctrl+C) is inaccessible or does not start.
v The server is UEFI-based, and the Preboot GUI configurator in “Applications and
Settings” is inaccessible or does not start.

Possible causes:
v The keyboard or mouse is faulty.
v There is a problem with UEFI, BIOS, or a device driver.
v PCI ROM execution is disabled in the setup utility.
v There is no power to the PCI slot. Check the light path diagnostics panel
(NMI/SMI).
v The SAS HBA is malfunctioning.
v There are bad storage disk drives.
v The controller or riser card is not correctly seated in the PCI slot.
v There is a problem with the system board.

Problem determination procedure:


1. Reseat the keyboard and mouse connections. If a KVM is used, test a local
keyboard and mouse to make sure that you can interact with the server.
2. Check the server light path diagnostics panel for possible issues.
3. Restart the server, and start the BIOS or UEFI setup utility by pressing F1.
4. From the setup utility, check the following settings to make sure that the PCI
slot is configured correctly:
a. Check the advanced PCI settings to enable PCI ROM execution for the
slot.
b. Check and enable Rehook Int 19h in the Start Options.
5. If the controller is not detected correctly, power off the server, remove all disk
drives from the backplane, and power on the server. If the controller is still not
detected correctly, go to the next step.
6. Power off the server and open the server cover.
7. Disconnect all SAS/SATA cables (including external SAS/SATA cables).
8. Disconnect the remote battery.
9. Remove the SAS HBA.
10. Inspect the PCI slot and controller for damage.
11. Make sure that the controller is in a supported PCI slot.
12. Inspect the SAS/SATA cables for damage, overstretching, nicks, or an
excessive bend radius.
13. Reseat the SAS HBA (do not attach SAS/SATA cables or battery).
14. Make sure that any PCI hot-plug latches are correctly seated.

50 SAS Host Bus Adapters: Problem Determination and Service Guide


15. Power on the server.
If these setting changes do not correct the problem, load the default settings to
try to correct the problem.
16. Observe for system power good indicators. (For more information, see the
Problem Determination and Service Guide that comes with the server.)
17. Observe for correct LED activity on the SAS HBA. Not all controllers have
LEDs. For more information, see the Quick Installation Guide for each
controller.
18. If the server is now able to detect the SAS HBA, power off the server and
attach the missing SAS/SATA cables and remote battery one at a time while
you try the tests again each time.
19. Reseat the battery (if applicable) and power on the server.

Note: If you get a battery error message, you might have a faulty battery.

Other considerations:

With only one controller and one PCI slot, it is difficult to determine which
component is at fault. Additional hardware, another available PCI slot, or another
test server greatly increases your ability to isolate the fault. For example, you can
test whether other adapters work in the suspect PCI slot and whether the controller
works correctly in another PCI slot or server.

Chapter 7. Problem determination procedures 51


One or more SAS HBAs are inaccessible when multiple storage
controllers are installed
Applicability:
v Two or more SAS HBAs are installed.
v Additional non-ServeRAID adapters are installed in the server.
v The server does not boot correctly after another controller is installed.
v The server generates an 1801 POST error after a new controller is installed.

Possible causes:
v The new controller takes a higher position in the PCI scan order and becomes
the primary boot controller.
v No option ROM memory space is available for additional adapters.

Problem determination procedure:


1. If the controller was recently added and caused the boot failure, remove the
new controller to confirm that the original configuration continues to work. If it
does, the cause of the problem is probably the scan order of the PCI buses in
the server. In some servers, you can change the scan order by using the BIOS
or UEFI setup utility, and in other servers, the scan order is fixed. In a server
with a fixed PCI scan order, a good work-around is to swap the slot locations of
the controllers in the server to make sure that the boot controller takes the
highest priority slot location.
2. Use the advanced PCI settings to enable or disable PCI ROM execution for the
slot. Only the boot SAS HBA requires the BIOS to be enabled. Usually you can
disable other SAS storage controllers without affecting their operational status.
This also frees up option ROM space for other adapters.
3. Make sure that both controllers are operating with the same software. For
multiple SAS HBAs to operate correctly within the same server, both controllers
must have the same BIOS and firmware to reduce the probability of
incompatibilities.
4. For the secondary controller, make sure that the attached drives do not contain
a bootable virtual disk from another configuration. If the data is not needed,
clear the configuration from the added drives by using the Preboot GUI.

52 SAS Host Bus Adapters: Problem Determination and Service Guide


System events-to-actions index
System events originate from the server and are logged in the service processor in
the server. Service processors vary from server to server, depending on the current
technology and generation of the hardware. System events can significantly impact
the operational status of the SAS HBA or provide clues to resolving complex issues.
Generally, the following list of event types can indicate a wide impact to the server
and in some cases can help to isolate a hardware problem from a software or
firmware problem:
v PCI events
v Memory events
v Processor events

For details about troubleshooting, see the Problem Determination and Service
Guide for the server and review the information about system-event logs and
service processors, such as the integrated management module (IMM), baseboard
management controller (BMC), and Remote Supervisor Adapter.

Chapter 7. Problem determination procedures 53


54 SAS Host Bus Adapters: Problem Determination and Service Guide
Chapter 8. Replaceable components
Field replaceable units (FRUs) must be replaced only by a trained service
technician, unless they are classified as customer replaceable units (CRUs).

Tier 1 CRU: Replacement of Tier 1 CRUs is your responsibility. If IBM installs a Tier
1 CRU at your request without a service contract, you will be charged for the
installation.

Tier 2 CRU: You may install a Tier 2 CRU yourself or request IBM® to install it, at
no additional charge, under the type of warranty service that is designated for your
product.

For information about the terms of the warranty, see the Warranty Information
document and that comes with the SAS HBA. For information about getting service
and assistance, see Appendix A, “Getting help and technical assistance,” on page
57.
Table 6. Field replaceable units for the SAS HBSs
Part number Description
81Y4494 ServeRAID H1110 SAS/SATA Controller for IBM System x

68Y7363 IBM 6 Gb Performance Optimized HBA and expansion-slot bracket

68Y7354 IBM 6 Gb SAS HBA and expansion-slot bracket

© Copyright IBM Corp. 2010 55


56 SAS Host Bus Adapters: Problem Determination and Service Guide
Appendix A. Getting help and technical assistance
If you need help, service, or technical assistance or just want more information
about IBM products, you will find a wide variety of sources available from IBM to
assist you. This section contains information about where to go for additional
information about IBM and IBM products, what to do if you experience a problem
with your system, and whom to call for service, if it is necessary.

Before you call


Before you call, make sure that you have taken these steps to try to solve the
problem yourself:
v Check all cables to make sure that they are connected.
v Check the power switches to make sure that the system and any optional
devices are turned on.
v Use the troubleshooting information in your system documentation, and use the
diagnostic tools that come with your system. Information about diagnostic tools is
in the Problem Determination and Service Guide on the IBM Documentation CD
that comes with your system.
v Go to the IBM support website at http://www.ibm.com/supportportal/ to check for
technical information, hints, tips, and new device drivers or to submit a request
for information.

You can solve many problems without outside assistance by following the
troubleshooting procedures that IBM provides in the online help or in the
documentation that is provided with your IBM product. The documentation that
comes with IBM systems also describes the diagnostic tests that you can perform.
Most systems, operating systems, and programs come with documentation that
contains troubleshooting procedures and explanations of error messages and error
codes. If you suspect a software problem, see the documentation for the operating
system or program.

Using the documentation


Information about your IBM system and preinstalled software, if any, or optional
device is available in the documentation that comes with the product. That
documentation can include printed documents, online documents, readme files, and
help files. See the troubleshooting information in your system documentation for
instructions for using the diagnostic programs. The troubleshooting information or
the diagnostic programs might tell you that you need additional or updated device
drivers or other software. IBM maintains pages on the World Wide Web where you
can get the latest technical information and download device drivers and updates.
To access these pages, go to http://www.ibm.com/supportportal/ and follow the
instructions. Also, some documents are available through the IBM Publications
Center at http://www.ibm.com/shop/publications/order/.

Getting help and information from the World Wide Web


On the World Wide Web, the IBM website has up-to-date information about IBM
systems, optional devices, services, and support. The address for IBM System x®
and xSeries® information is http://www.ibm.com/systems/x/. The address for IBM
BladeCenter® information is http://www.ibm.com/systems/bladecenter/. The address
for IBM IntelliStation® information is http://www.ibm.com/systems/intellistation/.

© Copyright IBM Corp. 2010 57


You can find service information for IBM systems and optional devices at
http://www.ibm.com/supportportal/.

Software service and support


Through IBM Support Line, you can get telephone assistance, for a fee, with usage,
configuration, and software problems with System x and xSeries servers,
BladeCenter products, IntelliStation workstations, and appliances. For information
about which products are supported by Support Line in your country or region, see
http://www.ibm.com/services/supline/products/.

For more information about Support Line and other IBM services, see
http://www.ibm.com/services/, or see http://www.ibm.com/planetwide/ for support
telephone numbers. In the U.S. and Canada, call 1-800-IBM-SERV
(1-800-426-7378).

Hardware service and support


You can receive hardware service through your IBM reseller or IBM Services. To
locate a reseller authorized by IBM to provide warranty service, go to
http://www.ibm.com/partnerworld/ and click Find Business Partners on the right
side of the page. For IBM support telephone numbers, see http://www.ibm.com/
planetwide/. In the U.S. and Canada, call 1-800-IBM-SERV (1-800-426-7378).

In the U.S. and Canada, hardware service and support is available 24 hours a day,
7 days a week. In the U.K., these services are available Monday through Friday,
from 9 a.m. to 6 p.m.

IBM Taiwan product service

IBM Taiwan product service contact information:


IBM Taiwan Corporation
3F, No 7, Song Ren Rd.
Taipei, Taiwan
Telephone: 0800-016-888

58 SAS Host Bus Adapters: Problem Determination and Service Guide


Appendix B. Notices
This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in
other countries. Consult your local IBM representative for information on the
products and services currently available in your area. Any reference to an IBM
product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product,
program, or service that does not infringe any IBM intellectual property right may be
used instead. However, it is the user's responsibility to evaluate and verify the
operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter
described in this document. The furnishing of this document does not give you any
license to these patents. You can send license inquiries, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10504-1785
U.S.A.

INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS


PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE. Some states do not allow disclaimer of express or
implied warranties in certain transactions, therefore, this statement may not apply to
you.

This information could include technical inaccuracies or typographical errors.


Changes are periodically made to the information herein; these changes will be
incorporated in new editions of the publication. IBM may make improvements and/or
changes in the product(s) and/or the program(s) described in this publication at any
time without notice.

Any references in this information to non-IBM websites are provided for


convenience only and do not in any manner serve as an endorsement of those
websites. The materials at those websites are not part of the materials for this IBM
product, and use of those websites is at your own risk.

IBM may use or distribute any of the information you supply in any way it believes
appropriate without incurring any obligation to you.

Trademarks
IBM, the IBM logo, and ibm.com are trademarks of International Business Machines
Corp., registered in many jurisdictions worldwide. Other product and service names
might be trademarks of IBM or other companies. A current list of IBM trademarks is
available on the web at “Copyright and trademark information” at
http://www.ibm.com/legal/copytrade.shtml.

Adobe and PostScript are either registered trademarks or trademarks of Adobe


Systems Incorporated in the United States and/or other countries.

© Copyright IBM Corp. 2010 59


Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc., in the
United States, other countries, or both and is used under license therefrom.

Intel, Intel Xeon, Itanium, and Pentium are trademarks or registered trademarks of
Intel Corporation or its subsidiaries in the United States and other countries.

Java and all Java-based trademarks and logos are trademarks or registered
trademarks of Oracle and/or its affiliates.

Linux is a registered trademark of Linus Torvalds in the United States, other


countries, or both.

Microsoft, Windows, and Windows NT are trademarks of Microsoft Corporation in


the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other
countries.

Important notes
Processor speed indicates the internal clock speed of the microprocessor; other
factors also affect application performance.

CD or DVD drive speed is the variable read rate. Actual speeds vary and are often
less than the possible maximum.

When referring to processor storage, real and virtual storage, or channel volume,
KB stands for 1024 bytes, MB stands for 1,048,576 bytes, and GB stands for
1,073,741,824 bytes.

When referring to hard disk drive capacity or communications volume, MB stands


for 1,000,000 bytes, and GB stands for 1,000,000,000 bytes. Total user-accessible
capacity can vary depending on operating environments.

Maximum internal hard disk drive capacities assume the replacement of any
standard hard disk drives and population of all hard disk drive bays with the largest
currently supported drives that are available from IBM.

Maximum memory might require replacement of the standard memory with an


optional memory module.

IBM makes no representation or warranties regarding non-IBM products and


services that are ServerProven®, including but not limited to the implied warranties
of merchantability and fitness for a particular purpose. These products are offered
and warranted solely by third parties.

IBM makes no representations or warranties with respect to non-IBM products.


Support (if any) for the non-IBM products is provided by the third party, not IBM.

Some software might differ from its retail version (if available) and might not include
user manuals or all program functionality.

60 SAS Host Bus Adapters: Problem Determination and Service Guide


Index
A M
assistance, getting 57 MegaRAID Storage Manager event messages 27
attention notices 2 messages
event 27
MegaRAID Storage Manager 27
C POST 21
caution statements 2
collecting data 7
collecting data using DSA 7 N
CRU part numbers 55 notes 2
customer replaceable units (CRUs) 55 notes, important 60
notices 59
notices and statements 2
D
danger statements 2
data collection 7 P
documentation, related 1 parts listing 55
DSA, using to collect data 7 POST messages 21
power on, working inside server 4

E
electrical equipment, servicing vi R
event messages 27 related documentation 1
MegaRAID Storage Manager 27 replacement parts 55
POST 21 RETAIN tips 10
returning components 5

F
FRU part numbers 55 S
Safety v
safety hazards, considerations vi
G safety statements viii
getting help 57 servicing electrical equipment vi
guidelines software service and support 58
installation 2 software updates 8
servicing electrical equipment vi statements and notices 2
system reliability 3 static-sensitive devices, handling 4
trained service technicians vi support, website 57
symptoms-to-actions 50
system events 53
H system reliability guidelines 3
handling static-sensitive devices 4
hard disk drive LEDs 19
hardware service and support 58 T
help, getting 57 telephone numbers 58
trademarks 59
troubleshooting procedures 10
I
IBM Support Line 58
important notices 2 U
inspecting for unsafe conditions vi updates, applying for software 8
installation guidelines 2 using DSA to collect data 7

L
LEDs, hard disk drive 19

© Copyright IBM Corp. 2010 61


W
website
publication ordering 57
support 57
support line, telephone numbers 58
working inside server 4

62 SAS Host Bus Adapters: Problem Determination and Service Guide




Part Number: 81Y1002

Printed in USA

(1P) P/N: 81Y1002

You might also like