0% found this document useful (0 votes)
15 views

How To Replace A Failed Storage Controller (XtremIO 6.x Only)

Uploaded by

mingli.bi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

How To Replace A Failed Storage Controller (XtremIO 6.x Only)

Uploaded by

mingli.bi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

How to replace a failed Storage Controller

(XtremIO 6.x only)

Notice

In general, SC replacement must be performed using the XtremIO


Technician Advisor utility
• For details on using the XtremIO Technician Advisor Utility to replace a failed Storage
Controller, refer to the XtremIO Technician Advisor User Guide in the XtremIO SolVe
Generator tool (under the Service Scripts and Utilities > XtremIO Technician Advisor
XtremIO Generator section).
• If according to the XtremIO Technician Advisor User Guide the Technician Advisor utili-
ty can not be used to replace a failed Storage Controller, proceed with the procedure in
this InsideEMC document to manually replace the failed Storage Controller

Introduction
There can be various hardware or software failures that necessitate replacing a Storage
Controller in the XtremIO Cluster.

Table of Content
• Introduction on page 1
• Table of Content on page 1
• References used in this procedure on page 1
• SC replacement with active cluster on page 2
• Part 1 - Prerequisites prior to scheduling SC FRU on page 2
• Part 2 - SC-FRU Known issues on page 3
• Part 3 - Storage Controller Re-Image on page 4
• Part 4 - SC replace preparation steps on page 5
• Part 5 - SC replacement steps on page 6
• Part 6 - Post SC replacement steps on page 7
• Appendix A - How to verify TCP ports before SC FRU on page 7

References used in this procedure


This section includes all items referenced in this procedure. Please make sure to get them
all before starting to execute this procedure.
• How to create, use, and manage the screen user - https://support.emc.com/
kb/483835

1
How to replace a failed Storage Controller (XtremIO 6.x only)

• Preparing for an XtremIO Storage Controller replacement (SC FRU) - https://


support.emc.com/kb/486531
• How to run XtremIO HCS - DOC-103476

• XtremIO FRU Replacement Procedures (FRU replacement guide) for the version
running on the XtremIO cluster - The FRU guide can be downloaded using the
XtremIO SolVe Generator or from the support page for XtremIO
• Storage Controller Rescue Image - XtremIO Storage Controller 6.0.0-59 Rescue
Image
• Win32DiskImager tool to burn USB drive - Win32DiskImager-0.9.5-binary.zip

SC replacement with active cluster


IMPORTANT

Please review How to create, use and manage the Screen user "scruser" if the upgrade is
being performed remotely (i.e: ESRS) to avoid process failure due to network disconnecti-
ons
This procedure should be followed when the cluster is active.

Part 1 - Prerequisites prior to scheduling SC FRU


1. IMPORTANT:If the XtremApp version is 6.0.1-27 and below, escalate to Engineering to
resolve known issues:
Resolving WWPN mismatch and Resolving NVRAM mismatch (http://support.emc.com/
kb/518700)

1. IMPORTANT: To ensure the XtremIO cluster host environment is ready for the
upcoming replacement, review with the customer the SC-FRU pre-checks ahead of the
upcoming replacement.
For details on the SC-FRU pre-checks, refer to EMC KB# 486531-Preparing for an
XtremIO Storage Controller replacement (SC FRU)
2. IMPORTANT: XtremIO Engineering advises to upgrade the XMS
to 6.2.0-85 and above in the event the current XMS is at 6.2.0-81
in order to avoid several known issues with SC FRU
if customer cannot upgrade to XMS 6.2.0-85, refer to the following Link for workaround
3. Obtain the XtremIO FRU Replacement Procedures (FRU replacement guide) for the
version running on the XtremIO cluster
The FRU guide can be downloaded using XtremIO Solve Generator or from XtremIO
support Page
4. Inform the CE to download the XtremIO SC Rescue Image matching the version
running on the XtremIO cluster

2
How to replace a failed Storage Controller (XtremIO 6.x only)

For clusters running XtremIO version 6.0.0-55 and above, download the XtremIO
Storage Controller 6.0.0-59 Rescue Image SC rescue image
5. Inform the CE to Create a bootable USB drive with the XtremIO SC Rescue image
according to the SC software re-installation procedure in the FRU replacement guide.
The free tool Win32DiskImager-0.9.5-binary.zip can be used to create the USB drive.
6. Inform the customer to get a KVM/USB Keyboard ready.
7. Upload the latest versions of the XtremIO Health Check Script (HCS)

Refer to https://inside.emc.com/docs/DOC-103476 for the procedure to upload the


XtremIO HCS to the XMS
8. It is recommended to execute the Health Check script while excluding the Storage
controller being replaced from the check, to avoid false alerts.
Additionally HCS must be run with the sc_fru tag (check-type) in order to run the
relevant tests.
If the XMS manages more than one cluster add also the --cluster-id <x> property.
xmcli (tech)> run-script script="system_health-v200.3.1-s4.0.0.py"
arguments="--exclude X1-SC1 --check-type fru_sc"
9. Verify with customer that the TCP ports required for FRU are opened between the XMS
and the storage controllers. If ports are blocked the SC replacement will fail.
For details, refer to Appendix A - How to verify TCP ports before SC FRU.
10.In case of a partial SC failure where the SC continues to serve IO to hosts, Please
confirm with the customer that the hosts will have sufficient paths to access the
XtremIO cluster while the failed SC will be disconnected.
11.Record MGMT IP configuration from the failed SC (management IP addresses,
netmask, and default gateway)
This step is only relevant to X1-SC1 or X1-SC2 (all other SCs do not
connect via MGMT port)
12.Capture failed SC FC port 1 & 2 WWPNs and iSCSI IQNs using information in the
cluster log bundle, or using the following command:
xmcli (tech)> show-targets
13.Ensure the correct Storage controller part-number is ordered for replacement,
according to the system type.
For details, refer to the XtremIO FRU Guide for the version running on the cluster.

Part 2 - SC-FRU Known issues


1. In case of more than one SC have failed and not responding, and XMS is below
6.1.0-99, perform the following:
a. Physically replace both SCs
b. Re-image both SCs
c. confirm both SCs are up and cabled properly (OS is up)
d. Run the SC-FRU procedure one SC at a time

3
How to replace a failed Storage Controller (XtremIO 6.x only)

e. If the issue persists, escalate to Engineering


2. If you encounter the following message, follow http://support.emc.com/kb/525301 to fix
the issue:
User: tech, Command: replace_storage_controller, Failed: ram_too_low
3. If the cluster is on version 6.1.0-99, you will need to follow http://support.emc.com/
kb/520862 to fix an issue related to datalogs
4. if the SC-FRU fails on the following message, and an Infiniband port is in system-
disabled, escalate to Engineering (for more details refer to http://support.emc.com/
kb/503287)
tech, Command: replace_storage_controller, Failed: [Failure
instance: Traceback: <class 'xceptions.CcFault'>: <Fault 600:
'storage_controller_ib_port_not_up'

Part 3 - Storage Controller Re-Image

Information

If this is an SC Self FRU skip to Part <4>

1. IMPORTANT: Always Re-Image the SC prior to an SC FRU or


OCE (even if it arrives from Manufacturing)
2. Ask CE to connect the new SC aside to a separate power source
and KVM/Keyborad
Important: At this point, no other cable should be connected to the new SC.
3. Ask CE to connect to the new SC the USB drive that was prepared in step 3 of Part A.1
of this procedure
4. Power-up the new SC on and re-image it from the USB drive. For details on the steps
to re-image the SC, refer to the SC software re-installation procedure in the FRU
replacement guide
5. Log in as xinstall, and choose option [2] Display local storage controller

version and verify the correct image 6.0.0-59 is installed.


6. Confirm TECH port is working
a. Set the Laptop IP to 169.254.254.2/255.255.240.0
b.Connect the service laptop to the TECH port on the new SC
c.SSH as xinstall to the default IP 169.254.254.1 on the new SC.
d.make sure you can login to the SC
7. After the SC is re-imaged and rebooted, please ask CE to disconnect power cables
and KVM from the new SC.

4
How to replace a failed Storage Controller (XtremIO 6.x only)

Part 4 - SC replace preparation steps

Warning
For single brick cluster: Ask the CE to confirm the IB is 2m length (any other IB cable may
cause the SC FRU procedure to fail)
1. Login into XMS and start XMCLI session as tech
2. Collect a new log bundle before proceeding with the rest of this procedure
3. Ask CE to label all the cables connected to the failed SC (if that has not been done
already)
4. Deactivate the failed SC
xmcli (tech)> replace-storage-controller-prepare sc-id=<ID of the failed
SC> cluster-id=<ID of the cluster> <force>
11:20:57 - Storage-Controller-Name Index Cluster-Name X-Brick-
Index Mgr-Addr Mgr-Addr-Subnet MGMT-GW-IP
11:20:57 - X1-SC1 1 xbrick718 1 10.82.78.50 10.82.78.50/24
10.82.78.1
11:20:57 - Running validations
11:20:57 - Disabling Notifiers
11:20:57 - Deactivating Storage Controller
11:21:17 - Powering-off Storage Controller
11:21:33 - Removing old Storage Controller
11:21:33 - Please disconnect Storage Controller X1-SC1 [1]
11:21:33 - Proceed when Storage Controller is physically removed
Please Enter "Done" (to proceed with) or "Abort" (to cancel) the command. (Done/Abort):

Note: the force flag is optional and should only be used


when performing self SC FRU
5. Note - Because of multi cluster management the show-syr-notifier and show-email-
notifier will continue to show the notifiers as enabled even though they were disabled
per this specific cluster by the command regardless and not necessary for this
procedure to continue, here is command to see that the cluster indeed disabled
notifiers:
xmcli (tech)> query class="System"prop-list=["under_maintenance"]
Cluster-Name: xbrick41
Index: 3
Notifiers-Disabled: True
6. Physically Remove the faulty SC from the Rack cabinet

5
How to replace a failed Storage Controller (XtremIO 6.x only)

Part 5 - SC replacement steps

Warning
Due to issue discovered with VPLEX and zeroed WWNN, we do not connect the FC cables at this stage.

Warning
For large Scale environments (with many volumes), and Native Replication configured, NR may be automati-
cally suspended and resumed during the SC replacement
1. Ask the CE to physically insert the replacement SC:
01.IPMI
02.SAS
03.InfiniBand
04.Management (only if is X1-SC1 or X1-SC2)
05.Tech Dongle (only if is X1-SC1 or X1-SC2)

Power
2. Attach the following cables to the new SC
3. Ask CE to press the power button to power on the new SC and wait 10 minutes to
allow the new SC to boot.
4. Run the attached script to fix the IB rules issues:
a. Upload the signed script using xmsupload to the XMS
b. Run the script with the following options:
--tech-password <password> - optional
--cluster-id <cluster-id>
--sc-id <id> - id between 1-8

Example:
xmcli (tech)> run-script script="sc_ib_rules_fix-v1.0-
s4.0.0.py" arguments="--cluster-id=1 --sc-id=1"
11:58:05 - 2019-01-01 11:58:02,650 INFO Starting
execution of sc_ib_rules_fix version 1.0
11:58:10 - 2019-01-01 11:58:03,845 INFO Cluster name:
xbrick742-744 Cluster PSNT: XIO00182211064
11:58:10 - 2019-01-01 11:58:04,872 INFO Storage
controller: X1-SC1

6
How to replace a failed Storage Controller (XtremIO 6.x only)

11:58:10 - 2019-01-01 11:58:04,872 INFO Checking IB


port status pre-fix...
11:58:10 - 2019-01-01 11:58:08,794 INFO Both IB
interfaces are available on the storage controller. Exiting
without changes
Script exited with status: 0
5. Run the replace command on xmcli:
xmcli (tech)> replace-storage-controller sc-id=<ID of the failed SC> cluster-id=<ID of the
cluster>

Note: The CE must remain next to the cluster in case cabling should be fixed while the
command is running
Note: The process should take approximately 30 minutes.

Part 6 - Post SC replacement steps


1. Ask CE to connect the target ports (FC or iSCSI).
2. confirm the target ports are synchronized at proper speed (10Gb
for iSCSI or 8/16Gb for FC)
if for some reason the port is not synchronizing properly, escalate
to Engineering
3. confirm all hosts multi-path is restored to normal
4. collect a fresh log bundle
5. run HCS

Appendix A - How to verify TCP ports before SC FRU


1. The following ports should be open between the XMS and Storage controllers:
TCP 11000 through 11015
TCP 22000 through 22015
TCP 23000 through 23015
2. Example port usage:
11000-11001 - X1-SC1
11002-11003 - X1-SC2

7
How to replace a failed Storage Controller (XtremIO 6.x only)

11004-11005 - X2-SC1
3. Check TCP port connectivity with the peer node, prior to running replacement command, for example for
testing connectivity with X4-SC1:

xmcli (tech)> test-xms-tcp-connectivity port=11012 server="<X1-SC1 mgmt


IP>"
Done!
Connectivity checked successfully
xmcli (tech)> test-xms-tcp-connectivity port=11013 server="<X1-SC1 mgmt
IP>"
Done!
Connectivity checked successfully
xmcli (tech)> test-xms-tcp-connectivity port=22012 server="<X1-SC1 mgmt
IP>"
Done!
Connectivity checked successfully
xmcli (tech)> test-xms-tcp-connectivity port=22013 server="<X1-SC1 mgmt
IP>"
Done!
Connectivity checked successfully
xmcli (tech)> test-xms-tcp-connectivity port=23012 server="<X1-SC1 mgmt
IP>"
Done!
Connectivity checked successfully
xmcli (tech)> test-xms-tcp-connectivity port=23013 server="<X1-SC1 mgmt
IP>"
Done!
Connectivity checked successfully

You might also like