0% found this document useful (0 votes)
14 views

Health Check Review Troubleshooting

Uploaded by

mingli.bi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Health Check Review Troubleshooting

Uploaded by

mingli.bi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

XtremIO Health Check requests

 Run the Health Check Script


AND
 Review the Dossier file for any abnormalities within the logs.

Why is an XtremIO Health Check needed?


The XtremIO Health Check SCRIPT was designed to find any known issues on the storage array. To find
any new issues on the storage array we need to review the Dossier log.
Therefore when we perform a health check review, we need to ensure two things:
1) That the array is currently healthy.
2) Which events had occurred in the past on the array (when the array was not at its healthiest)

 Run the Health Check Script.


 Review the Dossier file for any abnormalities within the logs.

How to obtain the HCS and Dossier


Download the latest Health Check Script version for the appropriate array code version from SolVe
Procedure Generator (or the internal L3L2 collaboration page).
Run the HCS on the array: XtremIO Health Check Script - Master article KB 464336.
Read its output carefully, and resolve ALL items as per their associated KB article instructions.

AND

Download the Dossier log bundle from the array


KB 334928 How to gather an XtremIO Dossier log bundle

When is a health check required?


For dial home events, customer/local team request, after a Support activity e.g. FRU replacement,
following a customer impacting event, before a planned activity, Collaborations, etc
We need to run the HCS script to check that the array is currently healthy or not, AND check the Dossier
event for the cause of past events. Also check upon the SR history in Service Cloud for a clue to a
previous dial-home or customer opened SR …maybe there was a planned activity at the site, etc.
If a collaboration/swarm is requested from your team – treat the requestor as if they were the
customer – in the sense that the same questions should be posed to determine what is needed.
The problem description
 In addition to what the customer has written into the Service Request, be sure to call them and
discuss to define clearly what the actual problem description is.
Example: Initial problem description: Array performance is slow
After speaking to the customer: There was a power outage on-site and their network connectivity
and host service bring-up is still resuming. Therefore of course the performance will be slow,
balancing all of the I/O load, until their connectivity normalizes we can then health check for the
array’s stability.

 Ensure you have a timestamp to focus in on the issue.


Note that the timestamps on the XMS and SCs may differ – please ensure to correlate these.
FAT automatically aligns the timestamps for you.

 Questions to ask the customer - When exactly did the issue occur? Is it still on-going? Did
anything in the environment change recently?

 ALWAYS check Knowledgebase articles for related issues

Collaboration/Swarm requests to your team – treat the requestor as if they were the customer – in the
sense that the same questions should be posed to determine what is needed for the request.

When you have gathered this basic information….

What to look for within the Dossier


Depending on the specific issue at hand, here are the initial files to check and be familiar with to be able
to determine what is/was occurring on the array:

alert.log – shows the XtremApp alerts i.e. XMCLI show-alerts


audit.log – history of commands parsed by the XMS and who initiated them
xms.log – main XMS management level logging file
messages.log – main Storage Controller level logging file
sym_events.log – system manager SYM events
sel_parsed.log – finds some hardware errors, such as DIMM, CATERR, PCIe

Also utilize the FAT tool to analyze the Dossier log.

Concentrate around the time of the issue and the events leading up to that time.
Specific issues
Dial home errors – refer to DH troubleshooting document
Check the initial time of the dial home alert
Check SR history for any on-going activities
Follow the linked KB article on the dial-home alert
Run the HCS – fix all fail errors
Check ‘what to check in Dossier’ section above
Search existing Knowledge Base articles
FRU items – SC, BBU, PSU, SSD – XtremIO FRU portal
SC – refer to XtremIO FRU portal and Master KB 482606
BBU – refer to XtremIO FRU portal and refer to BBU troubleshooting document
NDU issues – refer to NDU troubleshooting document
Connectivity – refer to Connectivity troubleshooting document
(SCSI reservations, host collaborations)
Switch – refer to Connectivity troubleshooting document
Host – RP + VPLEX document
Performance – refer to Performance troubleshooting document
XMS/GUI – search KB articles various issues

You might also like