ECS - ECS Miscellaneous How To Service Procedures-ECS Troubleshooting Procedures
ECS - ECS Miscellaneous How To Service Procedures-ECS Troubleshooting Procedures
Topic
ECS Miscellaneous 'How To' Service Procedures
Selections
Choose Activity: ECS Troubleshooting Procedures
REPORT PROBLEMS
If you find any errors in this procedure or have comments regarding this application, send email to
[email protected]
Copyright © 2022 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell
EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be
trademarks of their respective owners.
The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of
any kind with respect to the information in this publication, and specifically disclaims implied warranties of
merchantability or fitness for a particular purpose.
Use, copying, and distribution of any software described in this publication requires an applicable
software license.
This document may contain certain words that are not consistent with Dell's current language guidelines.
Dell plans to update the document over subsequent future releases to revise these words accordingly.
This document may contain language from third party content that is not under Dell's control and is not
consistent with Dell's current guidelines for Dell's own content. When such third party content is updated
by the relevant third parties, this document will be revised accordingly.
Page 1 of 87
Contents
Preliminary Activity Tasks .......................................................................................................3
Read, understand, and perform these tasks.................................................................................................3
Page 2 of 87
Preliminary Activity Tasks
This section may contain tasks that you must complete before performing this procedure.
Table 1 List of cautions, warnings, notes, and/or KB solutions related to this activity
2. This is a link to the top trending service topics. These topics may or not be related to this activity.
This is merely a proactive attempt to make you aware of any KB articles that may be associated with
this product.
Note: There may not be any top trending service topics for this product at any given time.
Page 3 of 87
Dell Technologies Confidential Information version: 2.3.6.91
Page 4 of 87
ECS Troubleshooting Guide v1.12
Note: The next section is an existing PDF document that is inserted into this procedure. You may see
two sets of page numbers because the existing PDF has its own page numbering. Page x of y on the
bottom will be the page number of the entire procedure.
Page 5 of 87
Troubleshooting guide
Abstract
This document will assist in basic troubleshooting steps for ECS. It will walk
through what to look for in the UI initially (not all inclusive), as well as some basic
CLI read-only commands. It also covers the Advanced Monitoring (Grafana) UI
that was introduced in ECS 3.4.0.0.
June 2021
Troubleshooting Guide
Page 6 of 87
Revisions
Revisions
Date Description
July 2021 Updated document with more troubleshooting steps.
Acknowledgments
Author: Dell Technologies
The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this
publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.
Use, copying, and distribution of any software described in this publication requires an applicable software license.
This document may contain certain words that are not consistent with Dell's current language guidelines. Dell plans to update the document over
subsequent future releases to revise these words accordingly.
This document may contain language from third party content that is not under Dell's control and is not consistent with Dell's current guidelines for Dell's
own content. When such third party content is updated by the relevant third parties, this document will be revised accordingly.
Copyright © 2021 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell
Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. [7/9/2021] [Troubleshooting guide]
Troubleshooting Guide
Page 7 of 87
Table of contents
Table of contents
Revisions.............................................................................................................................................................................2
Acknowledgments ...............................................................................................................................................................2
Table of contents ................................................................................................................................................................3
Disclaimer ...........................................................................................................................................................................5
Summary .............................................................................................................................................................................5
Pre-Requisites ....................................................................................................................................................................5
Using the UI ........................................................................................................................................................................5
1.1 View Current Alerts .............................................................................................................................................5
1.2 Node and Disk Health .........................................................................................................................................5
1.3 Capacity Utilization .............................................................................................................................................6
1.4 Garbage Collection .............................................................................................................................................6
1.5 Check Requests .................................................................................................................................................7
1.6 Performance .......................................................................................................................................................8
1.7 Process Health ...................................................................................................................................................8
1.8 Recovery Status .................................................................................................................................................9
1.9 RPO Status .......................................................................................................................................................10
1.10 Disk as a Customer Replaceable Unit ..............................................................................................................10
1.11 Advanced Monitoring (Grafana) .......................................................................................................................12
2 Using the CLI - Leveraging Service Tools..................................................................................................................24
2.2 Check Directory Tables (DTs) ..........................................................................................................................32
2.3 Service Restarts ...............................................................................................................................................34
2.4 Replication Status and TSO .............................................................................................................................36
2.5 Capacity ............................................................................................................................................................39
2.6 Space Reclamation/Garbage Collection ..........................................................................................................40
2.7 Networking ........................................................................................................................................................42
2.8 Alerts .................................................................................................................................................................44
2.9 Log collection ....................................................................................................................................................48
3 xDoctor .......................................................................................................................................................................50
3.1 sudo xdoctor -h .................................................................................................................................................50
3.2 Search for xDoctor rpm on Dell support site ....................................................................................................50
3.3 Download latest version (direct link) (v68 as of Nov 2020): .............................................................................51
3.4 Upgrade to the version in questions via: ..........................................................................................................51
3.5 How do I configure xDoctor to send xDoctor Reports to Customers via Email? ..............................................53
4 How to configure ECS to send required information to syslog...................................................................................57
Troubleshooting Guide
Page 8 of 87
Table of contents
Troubleshooting Guide
Page 9 of 87
Disclaimer
Disclaimer
This is not a replacement for Dell EMC Customer Service and/or Engineering. Please open a support case
when experiencing any issues with ECS. Troubleshoot at your own risk.
Summary
This document will assist in basic troubleshooting steps for ECS. It will walk through what to look for in the UI
initially (not all inclusive), as well as some basic CLI read-only commands.
It also covers the Advanced Monitoring (Grafana) UI that was introduced in ECS 3.4.0.0.
Pre-Requisites
Login credentials and access to the ECS UI and ssh to ECS nodes via CLI.
Using the UI
Below is a list of what to look for when users report that there may be issues with ECS.
Note: Depending on the version of ECS, you can also launch an Advanced Monitoring (Grafana) UI (ECS
3.4.0.0 and above).
In order to understand what each alert means, reference the latest Monitoring Guide.
Items to look for include node failure, disk failure, RPO lag time, and failover events.
Events
Make sure that all the nodes and disk health is “Good”, looking for keywords such as “Bad”, “Missing”,
“Removed”, “Suspect”.
Troubleshooting Guide
Page 10 of 87
Using the UI
System Health
Check to make sure that ECS is not pushing capacity thresholds (read-only at 90%).
Drill down on the VDC to start investigating capacity on nodes/disk, keeping in mind the load should be
distributed. Look at trends and forecasting as well.
Capacity Utilization
Troubleshooting Guide
Page 11 of 87
Using the UI
ECS UI has a section reporting various useful details regarding GC/SR, see below.
Capacity Utilization
User data GC is called Repo SR and System metadata SR is combination of both Btree and Journal SR.
If capacity pending reclamation is high and unreclaimable garbage is comparatively low then as a first step,
we need to run svc_gc a cli tool (details in CLI section below) to do a basic troubleshooting.
Unreclaimable garbage is the garbage detected in the system, which is not eligible for reclamation.
In ECS we have a concept of partial SR. If a chunk (ECS Technical FAQ for information about a chunk) has
2/3rd of garbage data (by default and it configurable based on the situation) then such a chunk is eligible for
reclamation. ECS internally moves the 1/3rd of valid data to the new chunk and reclaims the eligible garbage
chunk.
So, if the chunk/garbage which does not meet these criteria will be reported as unreclaimable garbage here.
This will show code numbers for various head services (S3, CAS, etc.).
Look for a high number of error codes. If consistently high 500 errors, look at Directory Table (DT) status
(discussed later in this document). If there are consistently high 400 errors, work with application teams to
check things such as permissions, certificates, networking etc.
Troubleshooting Guide
Page 12 of 87
Using the UI
Transactions
1.6 Performance
Click on Monitor |Transactions |Performance tab
Review the latency, bandwidth, and TPS metrics provided. Look for prolonged spikes as well as any
sustained increases. If there are sustained increases, check DT status.
Also, drill down on the VDC to make sure the nodes are being utilized evenly and there isn’t a potential issue
with load balancing.
Transactions
This allows you to check the health and status of CPU, memory, and NIC performance.
Troubleshooting Guide
Page 13 of 87
Using the UI
System Health
Keep in mind memory typically runs relatively high. Therefore, if it’s at a sustained higher level this may not be
a cause for concern.
Drill down into each node to view the various services, look for any reoccurring spikes or high %. Some of the
more critical ones are blobsvc (data operations header), cm (chunk manager), sr (space reclamation) and
objcontrolsvc.
Also, when looking at each node, review the restarts to make sure certain services aren’t continuously
bouncing.
Recovery is the process of rebuilding data after any local condition that results in bad data (i.e. bad chunks). It
is good to ensure that there is not a significant backlog here.
Recovery Status
Troubleshooting Guide
Page 14 of 87
Using the UI
Review pending replication and make sure RPO (Recovery Point Objective) is up-to-date or close to it. If
there is significant lag it could be indicative of an ECS and/or network issue that needs to be investigated.
Geo Replication
From ECS 3.5 version, we have introduced a new feature called CRU. The disk will be replaceable by
customer, this was FRU (field/DELLEMC) replaceable until 3.5. This feature makes it simple for end users to
replace HDD and SSDr (read cache) disks themselves through the UI – WITH ONE CLICK OF A BUTTON.
Replacement drives will be ordered automatically and shipped to the customer if customer site has Remote
Services (SRS) configured. Supported HW configurations: All Gen3 (EX300, EX500, EX3000) and Gen2 U-
Series.
Under Manage, a new page Maintenance is created to manage CRU feature. Please see the below
screenshot for information.
Maintenance
Troubleshooting Guide
Page 15 of 87
Using the UI
Maintenance
System Health
As you can see below, the 2 disks are reported as yellow. Click on yellow disk count icon to see the status of
disk recovery.
Maintenance
Troubleshooting Guide
Page 16 of 87
Using the UI
Maintenance
After the disk Recovery is completed automatically – disks are ready to REPLACE and we will see the alerts
like below.
Events
Disks are now ready to replace and you can see that in the maintenance tab, see below.
Click on the replace button and follow the onscreen instruction to complete the process. The disk which needs
to be replaced will have LED lit for easy identification as well.
Maintenance
Troubleshooting Guide
Page 17 of 87
Using the UI
Dashboards
List of dashboards present can be viewed by clicking dashboard name on top. The dashboards which were
accessed recently will show up in “Recent” folder. If ECS version < 3.5, then OE dashboards aren’t visible by
default. They need to be enabled using below SC command:
(where 169.X.X.X is private.4 for the ECS node). It can be disabled by setting the value as “false” in the
above command.
Troubleshooting Guide
Page 18 of 87
Using the UI
Dashboards
Note: To view “OE Dashboards”, you need to login using the “emcservice” account.
Note: GC/SR related dashboards are available from ECS 3.6 version onwards.
The dashboards provide an overview of the status of system in various fields. By default, they show data for
the last 24 hours for most of the reports. It can be modified by clicking on selected time range (“Last 24
Hours”) here:
Troubleshooting Guide
Page 19 of 87
Using the UI
Quick Ranges
Troubleshooting Guide
Page 20 of 87
Using the UI
This shows a summary of user requests in selected time range (on top right). For example, above shows that
system has:
It also provides a summary of latency as well. For example, the above shows that the system has:
Note: Above value of p50 means 50% of total read requests took less than 10.86 ms. A value of 2.57 seconds
for p90 means 90% of user requests took less than 2.57 seconds. It doesn’t mean some requests really took
2.57 seconds. It simply means 99% requests took less than 2.57 ms.
The graph is plotted with values using 5 mins interval. So, the bandwidth graph provides information on
read/write requests size for every 5 minutes. The legend also provides a summary of max/min/avg values for
selected time range.
In same dashboard, we can further drill down on the type of requests. For example, below will categorize the
successful/failed requests based on method (GET/PUT/HEAD etc.) or protocol (S3/CAS etc.) or error code
(500/404 etc.)
Troubleshooting Guide
Page 21 of 87
Using the UI
This “Data Access Performance” dashboard is also available with namespaces, nodes and protocols category
(separate dashboard for each).
Troubleshooting Guide
Page 22 of 87
Using the UI
This shows a summary of resource utilization at vdc level in selected time range (on top right). For example,
the above shows that system has:
There are graphs below which show a trend of each of the above metrics over selected time range.
The “Process Health” dashboard is also available with process and nodes category (separate dashboard for
each).
Like “Data Access Performance” and “Process Health”, there are dashboards available to monitor disks’
health as well: “Disk Bandwidth – Overview" and “Disk Bandwidth – By Nodes”.
The dashboard provides various migration details, see the screen shot below. The amount of data migrated
from source (to be retired nodes) to the new nodes, migration speed, time to completion etc.
Troubleshooting Guide
Page 23 of 87
Using the UI
This Grafana dashboard will be very useful to do basic troubleshooting to see if the data migration is progress
or not, is it happening for all the nodes etc.
E.g.
New SSD drives needs to be ordered for activate this feature. SSD read cache is supported only in a VDC
where all the nodes are the following hardware types:
• Gen3 EX-Series
• Gen2 U-Series
Please contact Professional services or your account representative if you have such a requirement.
Once the new SSD drives are inserted and feature is enabled, we can monitor critical parameters from
Grafana Dashboard.
Troubleshooting Guide
Page 24 of 87
Using the UI
Latency Numbers:
Latency
Disk Usage
1.11.5 OE Dashboards
These dashboards are only available with emcservice/emcmonitor account. They provide further insights in
ECS which can help in troubleshooting:
Troubleshooting Guide
Page 25 of 87
Using the UI
OE Dashboards
(IOE) DT Status
Troubleshooting Guide
Page 26 of 87
Using the UI
The first graph shows status of all DTs over selected time range. The second graph shows the
unready/unknown (if any) DT type and count.
If a user/client complains of access/latency issues, the first thing is to check for DT status during that time
period which can be quickly found out using above graphs. The output is like “svc_dt check” discussed in the
CLI section later in the document. Unready DTs can be caused by to service restarts too. You can compare
DT unready time with service restarts to see if they are related events.
DT Distribution
Troubleshooting Guide
Page 27 of 87
Using the UI
Above graph tells that there were few service restarts in selected time range. The legend below in the graph
shows the service name and total count of restarts in the time range. The names of the corresponding
hostnames (on which service restarts happened) can also be found in the legend.
When troubleshooting for any performance issue, we should always first check for any service restarts and
DT status for that time period, and above graph can help in that.
On top left, we also have a dropdown for hostnames and service names which can be further used to monitor
for restarts on a specific host or for a service.
Above graph shows the memory and swap utilization by each service. It also shows number of open fds by
each service.
In the same dashboard, there are few other graphs which show thread count, CPU utilization and disk IO by
each service in selected time range.
Troubleshooting Guide
Page 28 of 87
Using the UI
2.1.1 svc_version
The script, svc_version, can be run to check the ECS version and also other components as you can see in
the below screen shot.
svc_version
Troubleshooting Guide
Page 29 of 87
Using the UI
2.1.1.1 svc_version –h
svc_version-h
The output that you will see with this command includes the following - look for high latency or poor
performance that may be impacting:
Troubleshooting Guide
Page 30 of 87
Using the UI
Typically, during initial troubleshooting, the option -min is set to “x minutes ago” or n for “x hours ago”.
Another common option is -s, which gives a shortened summary output shown below.
2.1.2.1 kpi.sh -h
kpi sh - h
Without specifying any options, the default output is based on the past 60 minutes and displays the long form
output.
Troubleshooting Guide
Page 31 of 87
Using the UI
kpi sh – s – min 30
Here it is important to look at a balance across nodes and if you see a large amount of 500 errors. Typically, a
DT issue will impact all nodes.
The command below combines a few options where 403 errors (for example) are gathered during a specific
five-minute time period:
Troubleshooting Guide
Page 32 of 87
Using the UI
You can also run this command against a particular bucket if you know what the application is using.
2.1.2.4 kpi.sh -s
Kpi sh - s
kpi sh – s – cas
Troubleshooting Guide
Page 33 of 87
Using the UI
HTTP
Error Code Status Generic Error Code Description Error
Code
AccessDenied 403 AccessDenied Access Denied
BadDigest 400 BadDigest The Content-MD5 you specified did not
match that received.
BucketAlreadyExists 409 BucketAlreadyExists The requested bucket name is not
available. The bucket namespace is
shared by all users of the system. Please
select a different name and try again.
BucketNotEmpty 409 BucketNotEmpty The bucket you tried to delete is not
empty.
ContentMD5Empty 400 InvalidDigest The Content-MD5 you specified was
invalid.
ContentMD5Missing 400 InvalidRequest The required Content-MD5 header for this
request is missing.
Troubleshooting Guide
Page 34 of 87
Using the UI
Troubleshooting Guide
Page 35 of 87
Using the UI
2.1.3 SVC_REQUEST
svc_request -h
svc_request -h
Troubleshooting Guide
Page 36 of 87
Using the UI
ECS stores the metadata of important artifacts like bucket, namespace, and object in the form of "Directory
Tables" (DTs). Typically, the DTs are comparable to a database regarding traditional applications. There are
several Directory Tables in ECS that store specific types of data:
Troubleshooting Guide
Page 37 of 87
Using the UI
2.2.1 svc_dt -h
svc_dt - h
Look to see if any DTs are unknown or in an unready state. If you see any that are down or haven’t been
checked in recent time, open a case and inform support. Occasionally, there may be one that is unready,
however, if you see it’s sustained over multiple checks, open a case and inform support. Note that eight or
more unknown or unready DTs triggers an Alert which is sent to Dell EMC.
Troubleshooting Guide
Page 38 of 87
Using the UI
svc_dt hceck
Another useful DT command is “svc_dt dist” which shows how balanced the DTs are across all the nodes in
the VDC (ECS cluster). Note that the output should be “well” balanced based on the number of nodes in your
VDC. A node with very low or no DTs assigned is an indication of a node problem.
svc_dt dist
• blobsvc – Manages the following tables: Object (OB), Listing (LS), and Repo Chunk Reference (RR)
• cm - Manages the following tables: Chunk (CT), Btree Reference (BR). Provides the logic to handle
various events based on the chunk's current state and decide which state to transition to next.
• objcontrolsvc - Provides REST APIs for configuring the ECS cluster, managing ECS resources, and
monitoring the system.
Troubleshooting Guide
Page 39 of 87
Using the UI
• vnest - Provides distributed synchronization and group services. A subset of data nodes will be group
members responsible for serving the key/value requests. VNest services running on other nodes will
listen for configuration updates and be ready to be added to the group.
2.3.1 svc_node -h
svc_node - h
If these are getting restarted repeatedly, then there will be an impact to I/O and a SR should be opened.
Troubleshooting Guide
Page 40 of 87
Using the UI
2.4.1 svc_replicate -h
svc_replicate – h
When running a summary, it is important to look at current rates (per node), TSO, and what is pending
(typically, pending chunks never reach zero since there is always something replicating).
Troubleshooting Guide
Page 41 of 87
Using the UI
svc_replicate summary
In order to check the current Temporary Site Outage (TSO) state, the command below and its options provide
insight to the TSO state along with heartbeat and task status:
Troubleshooting Guide
Page 42 of 87
Using the UI
2.4.3 svc_tso -h
svc_tso - h
svc_tso summary
Troubleshooting Guide
Page 43 of 87
Using the UI
2.5 Capacity
svc_vdc_capacity
Troubleshooting Guide
Page 44 of 87
Using the UI
svc_vdc_trend
2.6.1 svc_gc -h
svc_gc – h
If there is concern about the rate at which deleted data is reclaimed, the rates reclaim option below will
display the daily reclaim rate for repo and btree data. For example, if your applications are deleting 1TB per
day and the reclaim rate is only 1GB per day, open an SR to investigate further.
Troubleshooting Guide
Page 45 of 87
Using the UI
When looking at repo, you can see the stats command will provide two sections of output.
The first will cover statistics broken down by full and partial garbage related to capacity. The other will do the
same but in chunks.
Keep in mind full garbage is when an entire chunk (128MB in size) is marked for 100% deletion. Partial
garbage is when a chunk is marked for deletion but less than a 100%. For example, you can have a chunk
that is 1/4 marked for deletion or 1/2 marked for deletion.
Furthermore, there are two types of partial garbage referred to as eligible and ineligible. Partial eligible is
when a chunk has been marked for at least 2/3 deletion. In this case, ECS will take the remaining 1/3 and
move it to another chunk which frees up 100% of the original chunk. Partial ineligible is when the chunk is
marked for less than 2/3rds deletion, in which case it will remain on the system until it meets the defined
threshold.
It is important to notice if you have a large amount of garbage stuck in reclaim (especially if it continuous to
increase rather than decrease). This information will help support understand if something may be stuck or if
various parameters should be changed/modified.
Troubleshooting Guide
Page 46 of 87
Using the UI
2.7 Networking
Although there are some networking statistics in the UI, failures are not one of them. However, there are
various statistics that can be pulled using the CLI (xDoctor also has alerts).
2.7.1 svc_network -h
svc_network - h
To check if a NIC is down or unavailable, run the following command. The screen shot below is for one node,
but all nodes are displayed when running the command.
Troubleshooting Guide
Page 47 of 87
Using the UI
Troubleshooting Guide
Page 48 of 87
Using the UI
svc_network summary
2.8 Alerts
We have alerts tab in the UI but managing these alerts from UI can be quite a task, so we have a CLI tool to
mail some of these alerts to user mails.
Troubleshooting Guide
Page 49 of 87
Using the UI
2.8.1 svc_alert
Tool to display, filter, clear, and send email notifications for system alerts.
Please update the xDr version to 4.8-74 or above to get latest changes/enhancements done to this tool which
are discussed below.
If we are using this tool for the first time then we need to update the alerts_conf.json file, a skeleton
alert_conf.json file is created with the below command, see the screenshot below.
The alert_conf.json needs to be update with recipents_mailId, alert type and severity, smtp sender mail ID
and IP address.
E.g.
Troubleshooting Guide
Page 50 of 87
Using the UI
svc_alerts list
svc_alerts summary
svc_alerts mail
Troubleshooting Guide
Page 51 of 87
Using the UI
svc_alerts mail_kpi
Troubleshooting Guide
Page 52 of 87
Using the UI
Log collection in a distributed system like ECS is very tedious task and svc_collect make it easier for collect
the logs for a specific duration and store it in compressed format. This tool is making easier to use
Troubleshooting Guide
Page 53 of 87
Using the UI
If there is a requirement just to collect the logs and not cofigs & commands, then -nocfg & -nocmd options can
be used.
This tool needs to be run on a non vnest member, please run svc_vnest members to choose the non vnest
member nodes to run the tool.
The logs that match the criteria from all the nodes are zipped and it will be stored under /tmp/
/tmp/svc_collect-SystemTest-20210526_090935.ziz
Troubleshooting Guide
Page 54 of 87
Using the UI
3 xDoctor
xDoctor is a tool used by Dell Customer Support to monitor, report on, and troubleshoot the health of your
ECS Appliance. Keeping xDoctor updated to the most current version enables Dell EMC Customer Support
to more quickly detect and resolve issues with your ECS Appliance.
The latest version is always available using the "xdoctor --upgrade –auto --now" option if the customer's ECS
system can establish a connection to ftp.emc.com. If not, the latest version can be downloaded via
dell.com/support (ECS Appliance / Drivers & Downloads / Category=Product Tool).
sudo xdoctor - h
Troubleshooting Guide
Page 55 of 87
Using the UI
Contact Dell Customer Service if you cannot access the above link.
sudo xdoctor -s
Troubleshooting Guide
Page 56 of 87
Using the UI
3.4.2 sudo xdoctor (this runs a standard health check on the rack in question)
sudo xdoctor
sudo xdoctor
Troubleshooting Guide
Page 57 of 87
Using the UI
3.4.3 sudo <Session Report> -CEW (this prints the Critical/Error/Warning messages
of the report in question)
3.5 How do I configure xDoctor to send xDoctor Reports to Customers via Email?
Please follow the steps below.
┌────────────────────────────┐
└───┬────────────────────────┘
┌───┼──────────┐
│ 1 │ Overview │
└───┼──────────┘
┌───┼────────────────────┐
└───┼────────────────────┘
┌───┼─────────────┐
│ 3 │ Auto Update │
└───┼─────────────┘
┌───┼────────────────┐
│ 4 │ Data Scrubbing │
└───┼────────────────┘
Troubleshooting Guide
Page 58 of 87
Using the UI
┌───┼─────────────────────┐
└───┼─────────────────────┘
┌───┼───────────────┐
│ 6 │ IPMI Analysis │
└───┼───────────────┘
┌───┼──────┐
│ 0 │ Exit │
└───┴──────┘
┌────────────────────────────┐
└───┬────────────────────────┘
┌───┼───────────────────────────────┐
└───┼───────────────────────────────┘
│ Status = Enabled
└┬─
│ SRS 1 ID = e7ec9fbb-d0ae-4e09-a192-06b9aa8ce2d8
Troubleshooting Guide
Page 59 of 87
Using the UI
┌───┬┴───────────────────┐
│ 2 │ Events to Customer │
└───┼────────────────────┘
│ Status = Disabled
└┐
┌───┬┴──────────┐
│ 0 │ Main Menu │
└───┴───────────┘
Email Recipient (single) []: (single) []: [email protected] <- Enter customer's email address or mailing list
here
|- Recipients = [email protected]
|- TLS = False
Troubleshooting Guide
Page 60 of 87
Using the UI
• Contact Dell EMC Customer Service (i.e. create a SR) for any “Critical” or “Error” messages
that cannot be resolved/require more in-depth investigation. “Warning” messages do not typically
need any attention.
• xDoctor Release Notes (version 68):
• https://dl.dell.com/content/docu97687_xDoctor_ReleaseNotes_4.8-68.pdf?language=en_US
Troubleshooting Guide
Page 61 of 87
Using the UI
ECS Syslog (as a fabric application container) supports forwarding of the alerts and audit messages to one or
multiple remote syslog servers.
Alerts and audit messages are from system (host OS), Agent, Lifecycle, Registry, Zookeeper, Object services
to the fabric-syslog container via UDP socket (9154).
Rsyslog server must be configured to forward messages to the predefined localhost port (UDP 9154). No
extra configuration step is required for ECS OS (ECS appliance, ECS certified SD). For ECS custom SD
(DIY), customer is responsible to configure a syslog service on the node.
Customers are responsible for configuring their Syslog servers in order to receive alerts from the ECS. Please
refer Customer viewable KB in Reference section below which has sample setup from one of the internal Dell
EMC labs (Article Number: 000012004).
In this document, we will show how to use different configurations on ECS to send either complete log or part
of the log file based on some condition or summary produced from a script
https://object.ecstestdrive.com/ecstsguide/checkResponseCode.pl?X-Amz-
Algorithm=AWS4-HMAC-SHA256&X-Amz-
Credential=132657591476211228%40ecstestdrive.emc.com%2F20210630%2FNone%2Fs3%2Faw
s4_request&X-Amz-Date=20210630T065024Z&X-Amz-Expires=99999&X-Amz-
SignedHeaders=host&X-Amz-
Signature=473865e2d853d04420e47d5a81953031d2b1609cf946571135de212996ea2bac
IMPORTANT NOTE: When a NODE REPLACEMENT is performed, review MOTD and copy back these files
back in place post Node replacement procedure.
Troubleshooting Guide
Page 62 of 87
Using the UI
Details steps:
Command#cat
/opt/emc/caspian/fabric/agent/services/fabric/syslog/host/files/config-
syslog.conf
Example :
admin@orem-malachite:~>
cat/opt/emc/caspian/fabric/agent/services/fabric/syslog/host/files/config-
syslog.conf
$ModLoad imudp
$UDPServerAddress 127.0.0.1
$UDPServerRun 9514
*.info @10.247.200.80:514
*.info @10.247.200.85:514
admin@orem-malachite:~>
Example:
Troubleshooting Guide
Page 63 of 87
Using the UI
],
"status" : "OK",
"etag" : 5410
}
admin@orem-malachite:~>
a. Send complete dataheadsvc-access.log file (i.e all error codes including 200 OK). This can also
be achieved using svc_request (Refer Article Number: 000020726 in References section).
ruleset(name="ecsaccesslogs500errors") {
if ( $msg contains "HTTP/1.1 404" ) then
{action(type="omfwd" Target="10.247.200.80" Port="514" Protocol="udp")
stop
}
}
input(type="imfile" ruleset="ecsaccesslogs500errors"
File="/opt/emc/caspian/fabric/agent/services/object/main/log/dataheadsvc-
access.log"
Tag="ecs"
Severity="info"
Facility="local7"
StateFile="ecs500tosyslog")
admin@provo-malachite:~>
Troubleshooting Guide
Page 64 of 87
Using the UI
ruleset(name="ecsconnectionlimit") {
if ( $msg contains "Connection Limit(1000) reached" ) then
{action(type="omfwd" Target="10.247.200.80" Port="514" Protocol="udp")
stop
}
}
input(type="imfile" ruleset="ecsconnectionlimit"
File="/opt/emc/caspian/fabric/agent/services/object/main/log/dataheadsvc.log"
Tag="ecs"
Severity="info"
Facility="local7"
StateFile="ecsconnlimittosyslog")
admin@sandy-malachite:~>
ruleset(name="ecserrorcode_summary") {
action(type="omfwd" Target="10.247.200.80" Port="514" Protocol="udp")
stop
}
input(type="imfile" ruleset="ecserrorcode_summary"
File="/opt/emc/caspian/fabric/agent/services/object/main/log/dh_responsecode_sum
mary.15minout"
Tag="ecs"
Severity="info"
Facility="local7"
StateFile="ecserrorcodesummary")
admin@sandy-malachite:~>
Note: Please ensure ~/MACHINES has all node private.4 IPs from all Racks of the VDC.
Troubleshooting Guide
Page 65 of 87
Using the UI
Note: Please ensure ~/MACHINES has all node private.4 IPs from all Racks of the VDC.
5. Update the MOTD on all nodes to include config files and how to restore (This is needed when NR is
performed to restore config file)
IMPORTANT NOTE: When a NODE REPLACEMENT is performed, review MOTD and copy back these files
back in place post Node replacement procedure.
6. Monitor the log files are receiving the logs/summary on syslog server.
a. Monitor complete dataheadsvc-access.log file (i.e all error codes including 200 OK)
b. Monitor 500 errors from dataheadsvc-access.log file (For testing purpose used 404 error)
Troubleshooting Guide
Page 66 of 87
Using the UI
Troubleshooting Guide
Page 67 of 87
Using the UI
This prerequisite for this is a custom Slack app to be created, please visit the slack help center for more
information on how slack app can be created.
1. Create a new Slack app in the workspace where you want to post messages.
2. From the Features page, toggle Activate incoming webhooks on.
3. Click Add new webhook to workspace.
4. Pick a channel that the app will post to, then click Authorize.
5. Use your incoming webhook URL to post a message to Slack.
Below is the list of checks that are performed in the sample script.
1. ECS version
2. xDr version
3. VDC capacity
4. Directory Table Status
5. KPI summary
6. Any errors/warning reported by xDr.
1. SSH to any ECS node in the VDC which needs to be monitored as admin user.
https://object.ecstestdrive.com/ecstsguide/ECSGetVDCStats.py?X-Amz-Algorithm=AWS4-HMAC-
SHA256&X-Amz-
Credential=132657591476211228%40ecstestdrive.emc.com%2F20210630%2FNone%2Fs3%2Faws4_reque
st&X-Amz-Date=20210630T064321Z&X-Amz-Expires=99999&X-Amz-SignedHeaders=host&X-Amz-
Signature=9f7b8dc039086dd371659766f669229308e3cd58a7a736bad46b7ab61c46ae30
Ecstsguide
3. Set the ECSGetVDCStats.py script as Cron job. In the below screenshot, the ECSGetVDCStats.py script
is set to run every 2 hours and post result.
#crontab - e
Troubleshooting Guide
Page 68 of 87
Using the UI
4. The script runs and collect the data as Cron job based on how its configured in the cron entry (see
previous step).
'token': Token of the slack channel which is used to display the ECS data
All the members of the slack workspace except guests has access to this feature
Troubleshooting Guide
Page 69 of 87
Using the UI
Registration Process
1. Visit https://portal.ecstestdrive.com/
2. Click the button below to get started to get started
portal.ecstestdrive.com
4. After successful registration, you will receive an email, this email contains a link clicking it will complete
the process.
5. You will be presented with the EULA agreement. Please review the EULA, click a check box to indicate
acceptance, and then hit a submit button.
6. At this point all the provisioning is done i.e. their namespace, namespace management user, and object
users are created and credentials generated.
Troubleshooting Guide
Page 70 of 87
Using the UI
7.1 ECSSync
ecs-sync is an open-source tool designed to migrate large amounts of data in parallel. This data can originate
from many different sources.
There are many reasons why you may need to migrate data. Tech refreshes, switching vendors, evacuating
EOL racks. Maybe your application team is starting to embrace the object paradigm and wants existing files to
become objects. Or perhaps you need to move sensitive data out of a public cloud. No matter the reason,
ecs-sync can probably help. It was written specifically to move large amounts of data across the network
while maintaining app association and metadata. With ecs-sync, you can copy an NFS export into an S3
bucket. You can migrate clips from Centera to ECS. You can even zip up an Atmos namespace folder into a
local archive. There are many use-cases it supports.
Using a set of plug-ins that can speak native protocols (file, S3, Atmos and CAS), ecs-sync queries the
source system for objects using CLI-, XML- or IU-configured parameters. It then streams these objects and
their metadata in parallel across the network, transforming/logging them through filters, and writes them to the
target system, updating app/DB references on success. There are many configuration parameters that affect
how it searches for objects and logs/transforms/updates references. See the Full CLI Syntax for more details
on what options are available.
A Note on Support
ecs-sync is an open-source tool. As such, there is no commercial support for its use (any support provided on
github is best-effort and community-based). If you plan on migrating your production data, you should
consider a Dell Professional Services migration package. The Dell PS team have extensive knowledge of
ecs-sync and a migration package comes with the full commercial support of Dell EMC engineering.
7.3 Mongoose
Mongoose 3.x.x the documentation is available at https://github.com/emc-mongoose/mongoose/wiki
Mongoose is a tool which is initially intended to test ECS performance. It is designed to be used for:
• Load Testing
• Stress Testing
• Soak/Longevity/Endurance Testing
• Volume Testing
• Smoke/Sanity Testing
Mongoose can sustain millions of concurrent connections and millions of operations per second.
Troubleshooting Guide
Page 71 of 87
Using the UI
7.4 Tools
• smart-client-java
o https://github.com/EMCECS/smart-client-java
• python-ecsclient
o https://github.com/EMCECS/python-ecsclient
Troubleshooting Guide
Page 72 of 87
Using the UI
• Check for DT status using “(OE) DT Status” dashboard in “Advanced Monitoring” section. Make sure to
cover the time range mentioned by user.
• Check for any service restarts in given time range using “(OE) Service Restarts” dashboard in “Advanced
Monitoring” section.
In most of the cases, performance issues are caused by DT related events or service restarts. If a service had
restarted, it would cause certain DTs to go down as well for certain amount of time while the service comes
up. If a service had restarted (mainly dataheadsvc, blobsvc, cm) then that would explain the latency/timeouts
experienced by user at that time. You can mention to user that a service restart event had occurred which
caused performance issues during that time. Please contact DellEMC Support for further help.
• Using “Data Access Performance - Overview“dashboard, verify if there was sudden spike in number of
requests in that time. A sudden increase in number of requests may cause memory pressure and lead to
latency issues. Check if the sudden spike is expected and verify same from application end.
• You can also verify if requests are balanced across nodes i.e., all nodes are getting same number of
requests.
• Using “(OE) Processes on Host” dashboard, verify if all resource usage is fine.
• Check for any service restarts.
• Open a ticket with DellEMC Support for further help.
8.1.3 Customer noticed the average write latency has gone up in the last 2 hours.
8.1.3.1 Things to check/do:
Important point to note here that only the write latency has increased but not read, if large files are being
uploaded then it's expected that the time taken to upload large file increase. We can check the transactions
for last 2 hours using svc_request -start "2 hours ago" -stop "now" summary and check if the size of the
objects being uploaded is not very huge. Please see the below screenshot for more details.
Troubleshooting Guide
Page 73 of 87
Using the UI
svc_request
8.2.1 Customer is not able to write and getting HTTP 403, Access Denied error
code.
8.2.1.1 Things to check/do:
HTTP error code 403 means “Access Denied” in most cases. 403 error can be verified using command
“kpi.sh -s –start “X mins ago” shown in cli section. It could be due to multiple reasons, but main things to
verify:
• Check if user has corrected permissions or is using correct credentials to access. Check permissions in
UI->Manage->Buckets, select namespace/bucket, edit Bucket, edit ACL and review user ACL.
• Check for time on client side, if it is in sync with time on ECS nodes.
8.2.2 Customer is not able to write and getting HTTP 403, Method Forbidden error
code
8.2.2.1 Things to check/do:
HTTP error code 403 may indicate “Method Forbidden” error as well. 403 error can be verified using
command “kpi.sh -s –start “X mins ago” shown in cli section. It’s mostly due to quota limit exceeded for the
bucket. Verify below things:
• From UI, check quota limit set for the bucket (UI->Manage->Buckets)
• From UI, check quota limit set for the namespace (UI->Manage->Namespace)
• Check current capacity utilization of bucket using Metering (UI->Monitor->Metering) or using svc_bucket
info <bucket_name>
• Increase quota limit if needed or inform client of usage limit
• Open a case with Dell EMC Support if the limit is not reached but a user is still getting quota limit reached
error.
Bucket Management
Troubleshooting Guide
Page 74 of 87
Using the UI
8.2.3 Customer is not able to read few objects and getting HTTP 404 return code.
8.2.3.1 Things to check/do:
HTTP error code 404 means object is not found on ECS. You can verify below things:
• Run svc_request –on $OBEJCTNAME summary in question and confirm that 404 is returned for GET
operation for this object.
• Check if object was ever written to ECS using application logs.
• Check if last update on object has dmarker (If dmarker is true then it’s a deleted object and 404 is
expected).
8.2.4 Customer is not able to delete object HTTP 409 error code was returned.
8.2.4.1 Things to check/do:
When trying to delete an object, if you are getting 409 error, this means that object is under retention period,
and cannot be deleted.
• Verify bucket retention period using ECS REST API: GET /object/bucket/{bucketName}/retention
• Check with bucket owner and modify policy if needed
Note: That retention can be set at Namespace, bucket and object level. The maximum retention value will be
enforced. So, we need to check the retention setting at all the three levels.
Troubleshooting Guide
Page 75 of 87
Using the UI
and POST) were hitting errors but not read (GET). Also, using svc_node tool service status was also checked
and none of the services were restarting.
Since reads were fine and DTs are ready and no service restarts. Capacity was checked using svc_vdc
capacity tool and it was found that there was no free space left and that’s why write were failing.
If the overall used capacity is at 90% then writes are not allowed. Please note that minimum of 3 nodes
whose overall capacity is less than 90% is required for a successful write.
In this instance, we found that the application was requesting for logging, requestPayment, tagging, website
from the dataheadsvc.log which are not supported/implemented and hence ECS throws 501 error.
Behavior is expected when the requested functionality is not implemented and its documented in the error
code page. Application should be updated to stop calling those APIs or expect 501 error code from ECS.
kpi.sh -s -start
The expanded node was on the same ECS software version as others and there were no service restarts or
DT unready issue.
svc_network check all and latest version of xdoctor was run to detect that there was a duplicate IP address in
the network that was causing issues, customer shutdown the VM which was assigned with the same IP
address and after that 500 errors were no longer reported.
Troubleshooting Guide
Page 76 of 87
Using the UI
Using svc_dt check tool, status of the DT was checked, and it was found that all the DTs were ready.
svc_network check all and svc_tso heartbeat reported connection issues to the remote VDC and if the
connection/heartbeat between the VDCs in federation is not working for 15 mins (default but it's configurable)
then Temporary Site Outage (TSO) will be triggered. There was an issue with the switch on customer side
and vendor was engaged to resolve the network issue between the 2 VDCs.
svc_tso summary was run to check the tso status and found TSO condition. Once the network issue was
resolved system came out of TSO. The 500 errors were no longer reported.
svc_request –on $OBEJCTNAME summary was run and found no request for this object so kpi.sh -s -bucket
$bucketname was run and found that there were no transactions at all for this bucket.
Further investigation on the load balancer side revealed that there was a network issue at Load balancer
which was causing the issue.
The issue got resolved after the network problem in load balancer was resolved.
• Make sure bucket is empty. If it’s not, use s3 browser (for a s3 bucket), or any other tool, to delete the
bucket contents first
• Check if user has sufficient permission to delete the bucket
Troubleshooting Guide
Page 77 of 87
Using the UI
You can get similar info using svc_bucket info <bucket_name>, but that’s federation level data, as opposed to
vdc level data in dashboard above.
Troubleshooting Guide
Page 78 of 87
Using the UI
Metering Page
Verify if the end user reported size and object count and what ECS is reporting are same, if not then there is a
metering discrepancy which is generally due to the following reasons.
• Incomplete MPU
• High number of non-current object versions.
• Compression of the Data at chunk layer.
Please contact DELLEMC support to for investigation into discrepancy take necessary action to correct the
metering discrepancy.
• If RPO is in few seconds/minutes, it maybe that huge amount of data was recently ingested. Wait for
some time for data to be copied, and check RPO again
• Below screenshot from ECS UI shows that RPO is NOT Up to date.
Troubleshooting Guide
Page 79 of 87
Using the UI
Geo Monitoring
• If RPO doesn’t come down and continues to increase, verify the replication network bandwidth b/w VDCs.
• Using svc_replicate summary, check if tasks in geo replication queue are moving. If any node doesn’t
show any activity, it may have a problem.
• Open an SR with Dell EMC Support if RPO continues to show lag.
8.6.1 Customer logged in to UI and found a node offline. Also unable to ssh to the
node in question
8.6.1.1 Things to check/do:
• Check System Event Log (SEL) for any CATERR (catastrophic error) or Processor IERR (Internal Error).
Run this command from a good node against the BMC IP/private IPMI of the problematical node:
getrackinfo -v (run this to get the BMC/private IPMI IP of the node in question)
Troubleshooting Guide
Page 80 of 87
Using the UI
getrackinfo -v
sudo ipmitool -I lanplus -H <BMC IP/private IPMI> -U root -P passwd sel elist
sudo ipmitool -I lanplus -H <BMC IP/private IPMI> -U root -P passwd sel elist
If the node cannot be brought back online, please open a SR with Dell EMC.
8.6.2 End user complaining, bucket utilization in UI is not reducing after deleting the
objects.
8.6.2.1 Things to check/do:
As a first step, we will check if the User data and system metadata GC process is enabled. We can check this
from ECS UI.
Troubleshooting Guide
Page 81 of 87
Using the UI
Capacity Utilization
The garbage reclaim rate can be checked, using svc_gc rates reclaim.
Using svc_gc stats repo and svc_gc stats btree, ensure the reclaimable garbage is high (in TBs).
8.6.3 Scenario 19: Customer is not able to ssh to the ECS node.
8.6.3.1 Things to check
Ping the ECS nodes from your workstation and verify the network connectivity is fine.
If yes, then login to UI and verify if you can login to ECS UI fine and check if any of the nodes are reported as
offline from UI(Monitor--> System Health-->Offline Nodes).
If no nodes are reported as offline then navigate to Settings--> platform locking and verify if the nodes are
locked, if the nodes are locked from platform then you will not be able to ssh to ECS nodes.
Troubleshooting Guide
Page 82 of 87
Using the UI
Platform Locking
If all the nodes are unlocked, then verify if you can ssh from other workstation.
8.6.4 Scenario 20: Customer reported that ECS is not dialing home.
8.6.4.1 Things to check/do:
Login to ECS UI, in Settings tab verify that ESRS server is reported connected. If it’s not, then verify the
network connectivity between ECS and ESRS server. If it is showing connected, then we can fire a test dial
home alert.
If dial home alert is still not received, then please contact DELLEMC support team.
8.7.1 Objects not expiring even after setting a life-cycle policy for a bucket.
8.7.1.1 Things to check/do:
Using svc_bucket info <bucket_name> verify that bucket policy is properly applied, like the name of the policy
setup is same as what is applied to this bucket in question also the expiry date in the bucket policy.
If its correct but still the objects are not expired, then please contact DELLEMC support for further
investigation.
Troubleshooting Guide
Page 83 of 87
Using the UI
8.8.1 Not able to open ECS UI after uploading the certificate to the Mgmt. interface
8.8.1.1 Things to check/do:
If the certificate uploaded is bad due to a new line or space, then after uploading the certificate the nginx
service will fail to restart. Please upload the corrected certificate again using the procedure in ECS admin
guide and that should fix the problem.
We can see the same issue if certificate chain is broken so please get the certificate validated from the
signing authority to make sure the certificate you are uploading to ECS is valid.
Procedure available to validate the certificate using OpenSSL can be used to checking/troubleshooting the
certificate related problems.
Please use, kpi.sh (details in the CLI section) script to see if there were any 500 errors for the time when ECS
got disconnected from DD. If so, then these 500 errors may be the reason for disconnecting. Contact support
team for further investigation into the RCA.
Even during DD clout tier cleanup, if for whatever reasons there are 500 errors on ECS during that time then
DD will disconnect from ECS. Please contact support team for further investigation into the error before
retrying the DD cloud tier cleanup again.
Troubleshooting Guide
Page 84 of 87
Using the UI
We can see that the new application is sending the request with host style addressing and because
appropriate Baseurl was not setup, ECS was interpreting it as path style which led to 500 error.
Please refer to admin guide on how Baseurl should be pre-configured based on how application is going to
send the request.
URL Format
BaseUrl used in a host-style URL should be pre-configured using the ECS Management API or the ECS
Portal (for example, emc.com in URL: bucketname.ns1.emc.com)
Login to ECS UI and under Dashboard, data pending to be converted to be EC is reported. There will always
be some data to be ECed in a busy ECS system but if there is a large amount of data pending to be ECed
then please contact support for further investigation.
Storage Efficiency
Troubleshooting Guide
Page 85 of 87
Using the UI
9 Additional Information
Troubleshooting Guide
Page 86 of 87
Dell Technologies Confidential Information version: 2.3.6.91
Page 87 of 87