100% found this document useful (1 vote)
3K views

ECS Troubleshooting Guide v1.11

This document provides guidance on troubleshooting the Dell ECS object storage platform. It outlines steps for reviewing alerts, hardware health, capacity utilization, request codes, performance metrics, process health, recovery status, and replication status using the ECS user interface. It also describes commands for checking version information and metrics using the ECS command line interface. Real-world troubleshooting examples and an overview of the advanced Grafana monitoring interface are included.

Uploaded by

puneetgoyal100
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
3K views

ECS Troubleshooting Guide v1.11

This document provides guidance on troubleshooting the Dell ECS object storage platform. It outlines steps for reviewing alerts, hardware health, capacity utilization, request codes, performance metrics, process health, recovery status, and replication status using the ECS user interface. It also describes commands for checking version information and metrics using the ECS command line interface. Real-world troubleshooting examples and an overview of the advanced Grafana monitoring interface are included.

Uploaded by

puneetgoyal100
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Dell Customer Communication - Confidential

ECS Troubleshooting Guide v1.11


Leveraging the UI and CLI

Nov 30, 2020


Dell Customer Communication - Confidential

Contents
Disclaimer ......................................................................................................................................................... 3
Summary ........................................................................................................................................................... 3
Pre-Requisites ................................................................................................................................................... 3
Using the UI....................................................................................................................................................... 3
View Current Alerts ....................................................................................................................................... 3
Node and Disk Health .................................................................................................................................... 3
Capacity Utilization ........................................................................................................................................ 4
Check Requests ............................................................................................................................................. 4
Performance.................................................................................................................................................. 5
Process Health ............................................................................................................................................... 5
Recovery Status ............................................................................................................................................. 6
RPO Status ..................................................................................................................................................... 7
Advanced Monitoring (Grafana)..................................................................................................................... 7
Using the CLI ................................................................................................................................................... 16
Leveraging Service Tools .............................................................................................................................. 16
ECS Version ................................................................................................................................................. 16
KPI Script ..................................................................................................................................................... 17
ECS CAS error codes ........................................................................................................................................ 20
ECS S3 error codes ........................................................................................................................................... 21
SVC_REQUEST ............................................................................................................................................. 22
Check Directory Tables (DTs) ....................................................................................................................... 24
Service Restarts ........................................................................................................................................... 26
Replication Status and TSO .......................................................................................................................... 27
Capacity....................................................................................................................................................... 29
Space Reclamation/Garbage Collection ....................................................................................................... 31
Networking.................................................................................................................................................. 33
xDoctor ........................................................................................................................................................... 36
Real World Examples: ...................................................................................................................................... 42
Performance Related Scenarios: .................................................................................................................. 42
Additional Information .................................................................................................................................... 53
Dell Customer Communication - Confidential

Disclaimer
This is not a replacement for Dell EMC Customer Service and/or Engineering. Please open a support case when
experiencing any issues with ECS. Troubleshoot at your own risk.

Summary
This document will assist in basic troubleshooting steps for ECS. It will walk through what to look for in the UI
initially (not all inclusive), as well as some basic CLI read-only commands.

It also covers the Advanced Monitoring (Grafana) UI that was introduced in ECS 3.4.0.0.

Pre-Requisites
Login credentials and access to the ECS UI and nodes.

Using the UI
Below is a list of what to look for when users report that there may be issues with ECS. Note that depending
on the version of ECS, you can also launch an Advanced Monitoring (Grafana) UI (ECS 3.4.0.0 and above).

View Current Alerts


Click on Monitor → Events → Alerts tab

In order to understand what each alert means, reference the latest Monitoring Guide.

Items to look for include node failure, disk failure, RPO lag time, and failover events.

Node and Disk Health


Click on Monitor → System Health → Hardware Health → All Nodes and Disks (drill down)

Make sure that all the nodes and disk health is “Good”, looking for keywords such as “Bad”, “Missing”,
“Removed”, “Suspect”.
Dell Customer Communication - Confidential

Capacity Utilization
Click on Monitor → Capacity Utilization

Check to make sure that ECS is not pushing capacity thresholds (read-only at 90%).

Drill down on the VDC to start investigating capacity on nodes/disk, keeping in mind the load should be
distributed. Take a look at trends and forecasting as well.

Check Requests
Click on Monitor → Transactions → Requests tab

This will show code numbers for various head services (S3, CAS, etc.).

Understanding the HTTP code numbers for S3:


400 = Application Error (User Failures)
500 = System Error (System Failures)
Dell Customer Communication - Confidential

Look for a high number of error codes. If consistently high 500 errors, take a look at Directory Table
(DT) status (discussed later in this document). If there are consistently high 400 errors, work with
application teams to check things such as permissions, certificates, networking etc.

Performance
Click on Monitor → Transactions → Performance tab

Review the latency, bandwidth, and TPS metrics provided. Look for prolonged spikes as well as any
sustained increases. If there are sustained increases, check DT status.

Also, drill down on the VDC to make sure the nodes are being utilized evenly and there isn’t a potential
issue with load balancing.

Process Health
Click on Monitor → System Health → Process Health

This allows you to check the health and status of CPU, memory, and NIC performance.
Dell Customer Communication - Confidential

Keep in mind memory typically runs relatively high. Therefore, if it’s at a sustained higher level this
may not be a cause for concern.

Drill down into each node to view the various services, look for any reoccurring spikes or high %. Some
of the more critical ones are blobsvc (data operations header), cm (chunk manager), sr (space
reclamation) and objcontrolsvc.

Also, when looking at each node, review the restarts to make sure certain services aren’t continuously
bouncing.

Recovery Status
Click on Monitor → Recovery Status

Recovery is the process of rebuilding data after any local condition that results in bad data (i.e. bad
chunks). It is good to ensure that there is not a significant backlog here.
Dell Customer Communication - Confidential

RPO Status
Click on Monitor → Geo Replication

Review pending replication and make sure RPO (Recovery Point Objective) is up-to-date or close to it.
If there is significant lag it could be indicative of an ECS and/or network issue that needs to be
investigated.

Advanced Monitoring (Grafana)


Click on Advanced Monitoring section in UI. It will redirect to a new page with Grafana dashboards.

List of dashboards present can be viewed by clicking dashboard name on top. The dashboards which
were accessed recently will show up in “Recent” folder. If ECS version < 3.5, then OE dashboards
aren’t visible by default. They need to be enabled using below SC command:

service-console run Configure_Grafana_Dashboards --enable-oe-


dashboards true --target-node 169.X.X.X
Dell Customer Communication - Confidential

(where 169.X.X.X is private.4 for the ECS node). It can be disabled by setting the value as “false” in the
above command.

Note: To view “OE Dashboards”, you need to login using the “emcservice” account.
Note: GC/SR related dashboards are available from ECS 3.6 version onwards.

The dashboards provide an overview of the status of system in various fields. By default, they show
data for the last 24 hours for most of the reports. It can be modified by clicking on selected time range
(“Last 24 Hours”) here:
Dell Customer Communication - Confidential

Some frequently accessed dashboards are discussed below:

➢ Data Access Performance – Overview


Dell Customer Communication - Confidential

This shows a summary of user requests in selected time range (on top right). For example, above
shows that system has:

1. Number of Successful requests = 174,845,496


2. Failures due to server-side issue = 818
3. Failures due to client-side issue = 200
4. Failures % = 0.001

It also provides a summary of latency as well. For example, the above shows that the system has:

1. Read requests have p50 = 10.86 ms


2. Read requests have p90 = 2.57 s

Note: Above value of p50 means 50% of total read requests took less than 10.86 ms. A value of 2.57
seconds for p90 means 90% of user requests took less than 2.57 seconds. It doesn’t mean some
requests really took 2.57 seconds. It simply means 99% requests took less than 2.57 ms.

The graph is plotted with values using 5 mins interval. So, the bandwidth graph provides information
on read/write requests size for every 5 minutes. The legend also provides a summary of max/min/avg
values for selected time range.

In same dashboard, we can further drill down on the type of requests. For example, below will
categorize the successful/failed requests based on method (GET/PUT/HEAD etc.) or protocol (S3/CAS
etc.) or error code (500/404 etc.)
Dell Customer Communication - Confidential

This “Data Access Performance” dashboard is also available with namespaces, nodes and protocols
category (separate dashboard for each).

➢ Process Health – Overview


Dell Customer Communication - Confidential

This shows a summary of resource utilization at vdc level in selected time range (on top right). For
example, the above shows that system has:

1. Avg. CPU Utilization = 5.19%


2. Avg. Memory Usage = 43.39 GiB
3. Relative Memory = 70.66%

There are graphs below which show a trend of each of the above metrics over selected time range.

The “Process Health” dashboard is also available with process and nodes category (separate dashboard
for each).

Like “Data Access Performance” and “Process Health”, there are dashboards available to monitor
disks’ health as well: “Disk Bandwidth – Overview" and “Disk Bandwidth – By Nodes”.

➢ OE Dashboards

These dashboards are only available with emcservice/emcmonitor account. They provide further
insights in ECS which can help in troubleshooting:
Dell Customer Communication - Confidential

Some OE Dashboards are discussed below:

➢ (OE) DT status

This dashboard provides the status of Directory Tables (DTs).


Dell Customer Communication - Confidential

The first graph shows status of all DTs over selected time range. The second graph shows the
unready/unknown (if any) DT type and count.

If a user/client complains of access/latency issues, the first thing is to check for DT status during that
time period which can be quickly found out using above graphs. The output is similar to “svc_dt
check” discussed in the CLI section later in the document. Unready DTs can be caused by to service
restarts too. You can compare DT unready time with service restarts to see if they are related events.

The dashboard also shows DT distribution at node level:

This output is similar to “svc_dt dist” discussed in the CLI section later in the document. It shows how
balanced the DTs are across all the nodes in the VDC (ECS cluster). Note that the output should be
“well” balanced based on the number of nodes in your VDC. A node with very low or no DTs assigned
is an indication of an issue with that node.

➢ (OE) Service Restarts

As the name suggests, this dashboard provides an overview of service restarts happening on cluster.
Dell Customer Communication - Confidential

Above graph tells that there were few service restarts in selected time range. The legend below in the
graph shows the service name and total count of restarts in the time range. The names of the
corresponding hostnames (on which service restarts happened) can also be found in the legend.

When troubleshooting for any performance issue, we should always first check for any service restarts
and DT status for that time period, and above graph can help in that.

On top left, we also have a dropdown for hostnames and service names which can be further used to
monitor for restarts on a specific host or for a service.

➢ (OE) Processes on host

This dashboard shows the resource utilization by ECS services on each host:

Above graph shows the memory and swap utilization by each service. It also shows number of open
fds by each service.

In the same dashboard, there are few other graphs which show thread count, CPU utilization and disk
IO by each service in selected time range.
Dell Customer Communication - Confidential

➢ (OE) Node system metrics

This is similar to “Processes on host” but it provides the resource utilization at node level rather than
at process level. We can monitor memory, swap, fds and disk space usage at node level:

Using the CLI

Leveraging Service Tools


The service tools are installed by default and can be run from any directory.

ECS Version
➢ svc_version

The script, svc_version, can be run to check the ECS version and also other components as you can see
in the below screen shot.

➢ svc_version –h
Dell Customer Communication - Confidential

KPI Script
The KPI script will provide various metrics around key performance indicators within ECS such as
number of requests, latency, and MB/s among others. These have various options that can be set to
give different outputs. Every environment is different, therefore, it’s important to run these
commands frequently in order to baseline what normal behavior looks like.

View these by running help:

➢ kpi.sh -h
Dell Customer Communication - Confidential

Without specifying any options, the default output is based on the past 60 minutes and displays the
long form output.

➢ kpi.sh

The output that you will see with this command includes the following - look for high latency or poor
performance that may be impacting:

• Overall Request Latency (ms)


• Request Latency Distribution (number of requests in each range)
• Request Sizes
• GET Latency Distribution (per request size)
• PUT Latency Distribution (per request size)
• Rate Statistics (per node)
• GET Extended stats (per request size)
• PUT Extended stats (per request size)
• Ingest Statistics (per node)

Typically, during initial troubleshooting, the option -min is set to “x minutes ago” or n for “x hours
ago”. Another common option is -s, which gives a shortened summary output shown below.

➢ kpi.sh -s -min 30
Dell Customer Communication - Confidential

Here it is important to look at a balance across nodes and if you see a large amount of 500 errors.
Typically, a DT issue will impact all nodes.

The command below combines a few options where 403 errors (for example) are gathered during a
specific five-minute time period:

➢ kpi.sh -s -start '2 days ago' -end 'now' -errs

You can also run this command against a particular bucket if you know what the application is using.

➢ kpi.sh -s
Dell Customer Communication - Confidential

➢ kpi.sh -s -cas (applicable for customers that use CAS)

ECS CAS error codes


http://doc.isilon.com/ECS/3.5/DataAccessGuide/GUID-E6C318F6-E2FB-438E-AF96-
016EC52D9048.html?hl=ecs%2Ccas%2Cerror%2Ccodes

Value Error Name Description


10020 FP_NO_POOL_ERR It was not possible to establish a
connection with a cluster. The
server could not be located. This
means that none of the IP
addresses could be used to open a
connection to the server or that
no cluster could be found that has
the required capability. Verify
your LAN connections, server
settings, and try again.
10021 FP_CLIP_NOT_FOUND_ERR Could not find the referenced C-
Clip in the cluster. Returned by
Dell Customer Communication - Confidential

FPClip_Open(), it means the CDF


could not be found on the server.
Verify that the original data was
correctly stored and try again.
10036 FP_BLOBIDMISMATCH_ERR The blob is corrupt: a BlobID
mismatch occurred between the
client and server. The Content
Address calculation on the client
and the server has returned
different results. The blob is
corrupt. If FPClip_Open() returns
this error, it means the blob data
or metadata of the C-Clip is
corrupt and cannot be decoded.
10101 FP_SOCKET_ERR An error on the network socket
occurred. Verify the network.
10153 FP_AUTHENTICATION_FAILED_ER Authentication to get access to
R the server failed. Check the profile
name and secret.
10201 FP_OPERATION_REQUIRES_MARK The application requires marker
support but the stream does not
provide that.
10204 FP_OPERATION_NOT_ALLOWED The use of this operation is
restricted or this operation is not
allowed because the server
capability is false.

ECS S3 error codes


http://doc.isilon.com/ECS/3.2/DataAccessGuide/ecs_r_s3_error_codes.html

Error Code HTTP Status Code Generic Error Code Description Error
AccessDenied 403 AccessDenied Access Denied
BadDigest 400 BadDigest The Content-MD5 you
specified did not match that
received.
BucketAlreadyExists 409 BucketAlreadyExists The requested bucket name
is not available. The bucket
namespace is shared by all
users of the system. Please
select a different name and
try again.
BucketNotEmpty 409 BucketNotEmpty The bucket you tried to
delete is not empty.
ContentMD5Empty 400 InvalidDigest The Content-MD5 you
specified was invalid.
ContentMD5Missing 400 InvalidRequest The required Content-MD5
header for this request is
missing.
Dell Customer Communication - Confidential

EntityTooSmall 400 EntityTooSmall The proposed upload is


smaller than the minimum
allowed object size.
EntityTooLarge 400 EntityTooLarge The proposed upload
exceeds the maximum
allowed object size.
IncompleteBody 400 IncompleteBody The number of bytes
specified by the Content-
Length HTTP header were
not provided.
InternalError 500 InternalError An internal error was
encountered. Please try
again.
ServerTimeout 500 ServerTimeout An internal timeout error
was encountered. Please try
again.
InvalidAccessKeyId 403 InvalidAccessKeyId The Access Key Id you
provided does not exist.
InvalidArgument 400 InvalidArgument Invalid Argument.
NoNamespaceForAnonymo 403 AccessDenied ECS could not determine the
us Request namespace from the
anonymous request. Please
use a

SVC_REQUEST
➢ svc_request -h
Dell Customer Communication - Confidential

➢ svc_request -s 404 errorsummary


Dell Customer Communication - Confidential

We can filter the errors bases on the type of http request with the -t option in the above command,
below example shows the errosummary for different types of HTTP operations which returned 404
error.

➢ svc_request -on 465abb83-5804-4fb5u97ee-0c5f0a9b9395 summary

If we know the object name in question, then we can search the transactions related to that particular
object with -on (objectname) option (see below). This object was uploaded using multi part upload so
we can see all the transaction details for this object.

Check Directory Tables (DTs)


One of the most common items to check when experiencing issues on the ECS is the status of the DT
tables.

ECS stores the metadata of important artifacts like bucket, namespace, and object in the form of
"Directory Tables" (DTs). Typically, the DTs are comparable to a database regarding traditional
applications. There are several Directory Tables in ECS that store specific types of data:
Dell Customer Communication - Confidential

• OB - Object table. The object related information.


• LS - Listing table. The listing entry related information. For example, all keys under one bucket
have one entry in the LS table. S3 bucket listing requests will go to the LS table.
• CT - Chunk table.
• BR - Btree Reference table.
• SS - Storage Space table. Maintains the disk block usage (allocation/free) information.
• PR - Partition Record table. Stores the DT record information.
• RT - Resource table. This is a special system table for the system related information, such as
replication group, namespace, and bucket.
• ET - Event table. This table is used to store system events like AUDITs and ALERTs.
• MA - Metering Aggregate table. Saves the aggregated metering information.
• MR - Metering Raw table. Saves the raw metering information that is later aggregated in MA
table.
• RR - Repo chunk reference table. Contains Repo chunk (Object chunk) reference information.

➢ svc_dt -h

Look to see if any DTs are unknown or in an unready state. If you see any that are down or haven’t
been checked in recent time, open a case and inform support. Occasionally, there may be one that is
unready, however, if you see it’s sustained over multiple checks, open a case and inform support.
Note that eight or more unknown or unready DTs triggers an Alert which is sent to Dell EMC.

➢ svc_dt check

Svc_dt check –f option can be used to query the DT status manually but as you can see below, without
the -f option the Auto DT check status is reported. The timestamp on the left is important to be
noticed the latest DT status will be reported on the top.
Dell Customer Communication - Confidential

Another useful DT command is “svc_dt dist” which shows how balanced the DTs are across all the
nodes in the VDC (ECS cluster). Note that the output should be “well” balanced based on the number
of nodes in your VDC. A node with very low or no DTs assigned is an indication of a node problem.

➢ svc_dt dist (use -f for a real-time check)

Service Restarts
Continuous service restarts can also have an impact to ECS health, so it is important to see if any are
getting restarted. Some of the ones to focus on include the following:

• blobsvc – Manages the following tables: Object (OB), Listing (LS), and Repo Chunk Reference
(RR)
• cm - Manages the following tables: Chunk (CT), Btree Reference (BR). Provides the logic to
handle various events based on the chunk's current state and decide which state to transition
to next.
• objcontrolsvc - Provides REST APIs for configuring the ECS cluster, managing ECS resources,
and monitoring the system.
• vnest - Provides distributed synchronization and group services. A subset of data nodes will be
group members responsible for serving the key/value requests. VNest services running on
other nodes will listen for configuration updates and be ready to be added to the group.
Dell Customer Communication - Confidential

➢ svc_node -h

If these are getting restarted repeatedly, then there will be an impact to I/O and a SR should be
opened.

➢ svc_node services -showrestarts

Replication Status and TSO


For statistics (outside the UI) on determining potential issues with replication, there are a couple of
commands that provide additional detail. Typically, this command is used after an RPO alert has been
Dell Customer Communication - Confidential

triggered and the UI shows that something may be stuck. Common issues that affect replication are
WAN outages or WAN saturation.

➢ svc_replicate -h

When running a summary, it is important to look at current rates (per node), TSO, and what is pending
(typically, pending chunks never reach zero since there is always something replicating).

➢ svc_replicate summary
Dell Customer Communication - Confidential

In order to check the current Temporary Site Outage (TSO) state, the command below and its options
provide insight to the TSO state along with heartbeat and task status:

➢ svc_tso -h

➢ svc_tso summary

Capacity

➢ svc_vdc capacity
Dell Customer Communication - Confidential
Dell Customer Communication - Confidential

➢ svc_vdc trend

Space Reclamation/Garbage Collection


To check further detail around garbage collection statistics (outside the UI), there is a command that
can provide a break down. This includes numbers around the two different types of garbage collection
that run on the ECS, repo (user data) and btree (metadata/index).

➢ svc_gc -h

If there is concern about the rate at which deleted data is reclaimed, the rates reclaim option below
will display the daily reclaim rate for repo and btree data. For example, if your applications are
deleting 1TB per day and the reclaim rate is only 1GB per day, open an SR to investigate further.

➢ svc_gc rates reclaim


Dell Customer Communication - Confidential

When looking at repo, you can see the stats command will provide two sections of output.

The first will cover statistics broken down by full and partial garbage related to capacity. The other will
do the same but in chunks.

Keep in mind full garbage is when an entire chunk (128MB in size) is marked for 100% deletion. Partial
garbage is when a chunk is marked for deletion but less than a 100%. For example, you can have a
chunk that is 1/4 marked for deletion or 1/2 marked for deletion.

Furthermore, there are two types of partial garbage referred to as eligible and ineligible. Partial
eligible is when a chunk has been marked for at least 2/3 deletion. In this case, ECS will take the
remaining 1/3 and move it to another chunk which frees up 100% of the original chunk. Partial
ineligible is when the chunk is marked for less than 2/3rds deletion, in which case it will remain on the
system until it meets the defined threshold.

It is important to notice if you have a large amount of garbage stuck in reclaim (especially if it
continuous to increase rather than decrease). This information will help support understand if
something may be stuck or if various parameters should be changed/modified.

➢ svc_gc stats repo

Stats can also be run for btree as well:

➢ svc_gc stats btree


Dell Customer Communication - Confidential

Networking
Although there are some networking statistics in the UI, failures are not one of them. However, there
are various statistics that can be pulled using the CLI (xDoctor also has alerts).

➢ svc_network -h

To check if a NIC is down or unavailable, run the following command. The screen shot below is for one
node, but all nodes are displayed when running the command.

➢ svc_network show int


Dell Customer Communication - Confidential

➢ svc_network show int

➢ svc_network check all

Network check within the VDC(LOCAL) from where svc_network was triggered. In the subsequent
screenshot below, you can see the network connectivity status between the VDCs.
Dell Customer Communication - Confidential

➢ svc_network summary
Dell Customer Communication - Confidential

xDoctor
xDoctor is a tool used by Dell Customer Support to monitor, report on, and troubleshoot the health of
your ECS Appliance. Keeping xDoctor updated to the most current version enables Dell EMC Customer
Support to more quickly detect and resolve issues with your ECS Appliance.

The latest version is always available using the "xdoctor --upgrade –auto --now" option if the
customer's ECS system can establish a connection to ftp.emc.com. If not, the latest version can be
downloaded via dell.com/support (ECS Appliance / Drivers & Downloads / Category=Product Tool).

➢ sudo xdoctor -h
Dell Customer Communication - Confidential

➢ Search for xDoctor rpm on Dell support site

https://www.dell.com/support/home/en-us/product-support/product/ecs-appliance-software-
with-encryption/drivers
Dell Customer Communication - Confidential

➢ Download latest version (direct link) (v68 as of Nov 2020):


https://dl.dell.com/downloads/DL97688_xDoctor4ECS-4.8-68.rpm
Contact Dell Customer Service if you cannot access the above link.

➢ Upgrade to the version in questions via:

sudo xdoctor --upgrade --local=/home/admin/xDoctor4ECS-4.8-68.noarch.rpm

➢ sudo xdoctor -s (this checks the version)

➢ sudo xdoctor (this runs a standard health check on the rack in question)
Dell Customer Communication - Confidential

➢ sudo <Session Report> -CEW (this prints the Critical/Error/Warning messages of the report in
question)

➢ How do I configure xDoctor to send xDoctor Reports to Customers via Email?

Please follow the steps below.

admin@provo-yellow:~> sudo xdoctor --config

┌────────────────────────────┐

│ xDoctor Configuration Menu │

└───┬────────────────────────┘

┌───┼──────────┐

│ 1 │ Overview │

└───┼──────────┘

┌───┼────────────────────┐

│ 2 │ Reports and Events │

└───┼────────────────────┘
Dell Customer Communication - Confidential

┌───┼─────────────┐

│ 3 │ Auto Update │

└───┼─────────────┘

┌───┼────────────────┐

│ 4 │ Data Scrubbing │

└───┼────────────────┘

┌───┼─────────────────────┐

│ 5 │ ECS API Credentials │

└───┼─────────────────────┘

┌───┼───────────────┐

│ 6 │ IPMI Analysis │

└───┼───────────────┘

┌───┼──────┐

│ 0 │ Exit │

└───┴──────┘

Please make a choice: 2

┌────────────────────────────┐

│ xDoctor Reports and Events │

└───┬────────────────────────┘

┌───┼───────────────────────────────┐

│ 1 │ Reports and Events to DellEMC │

└───┼───────────────────────────────┘

│ Status = Enabled

│ Channel = SMTP via SRS


Dell Customer Communication - Confidential

└┬─

│ SRS 1 ID = e7ec9fbb-d0ae-4e09-a192-06b9aa8ce2d8

│ SRS 1 Host = IP_ADDRESS

│ SRS 1 Port = 9443

│ SRS 1 State = CONNECTED

│ SRS 1 Msg = Communication with srs succeeds

│ SRS 1 S/N = SERIAL_NUMBER

┌───┬┴───────────────────┐

│ 2 │ Events to Customer │

└───┼────────────────────┘

│ Status = Disabled

└┐

┌───┬┴──────────┐

│ 0 │ Main Menu │

└───┴───────────┘


Please make a choice: 2

Send xDoctor Events to Customer? [No]: Yes

Email Recipient (single) []: (single) []:


[email protected] <- Enter customer's email address or
mailing list here

Add another Recipient? [No]:

Recipient (1): [email protected]

Dedicated SMTP Server [Server_name or IP_address:port] []:

(single) []: earth.sol.galaxy:25 <- Enter customer's SMTP


server here

Email From [[email protected]]:


Dell Customer Communication - Confidential

Enable TLS? [No]:

Do you want to use a fixed subject? [No]:

Do you want to use a subject prefix? [No]:

Do you want to use a subject suffix? [No]:

Send xDoctor Events to Customer = Yes

|- Recipients = [email protected]

|- SMTP Server = earth.sol.galaxy:25

|- TLS = False

|- Email From = [email protected]

> Issue new Settings? [No]: Yes

➢ Contact Dell EMC Customer Service (i.e. create a SR) for any “Critical” or “Error” messages that
cannot be resolved/require more in-depth investigation. “Warning” messages do not typically
need any attention.

➢ xDoctor Release Notes (version 68):


➢ https://dl.dell.com/content/docu97687_xDoctor_ReleaseNotes_4.8-68.pdf?language=en_US

Real World Examples:

Performance Related Scenarios:


❖ Scenario 1: Customer complained about timeouts when reading/writing during a given time interval

Things to check/do:
First thing to check would be any DT down event or service restarts in mentioned time frame.
• Check for DT status using “(OE) DT Status” dashboard in “Advanced Monitoring” section.
Make sure to cover the time range mentioned by user.
• Check for any service restarts in given time range using “(OE) Service Restarts” dashboard in
“Advanced Monitoring” section.

In most of the cases, performance issues are caused by DT related events or service restarts. If a
service had restarted, it would cause certain DTs to go down as well for certain amount of time while
the service comes up. If a service had restarted (mainly dataheadsvc, blobsvc, cm) then that would
explain the latency/timeouts experienced by user at that time. You can mention to user that a service
Dell Customer Communication - Confidential

restart event had occurred which caused performance issues during that time. Please contact DellEMC
Support for further help.

❖ Scenario 2: Customer complaining of latency issue

Things to check/do:
Latency issue is mostly due to memory pressure on ECS object services. In addition to verifying the
steps in first scenario you can verify below:
• Using “Data Access Performance - Overview“ dashboard, verify if there was sudden spike in
number of requests in that time. A sudden increase in number of requests may cause memory
pressure and lead to latency issues. Check if the sudden spike is expected and verify same from
application end.
• You can also verify if requests are balanced across nodes i.e., all nodes are getting same
number of requests.
• Using “(OE) Processes on Host” dashboard, verify if all resource usage is fine.
• Check for any service restarts.
• Open a ticket with DellEMC Support for further help.

❖ Scenario 3: Customer noticed the average write latency has gone up in the last 2 hours.
Things to check/do:

Important point to note here that only the write latency has increased but not read, if large files are
being uploaded then it's expected that the time taken to upload large file increase. We can check the
transactions for last 2 hours using svc_request -start "2 hours ago" -stop "now" summary and check if
the size of the objects being uploaded is not very huge. Please see the below screenshot for more
details.

Object Read/write Related Scenarios:


❖ Scenario 4: Customer is not able to write and getting HTTP 403, Access Denied error code.

Things to check/do:
HTTP error code 403 means “Access Denied” in most cases. 403 error can be verified using command
“kpi.sh -s –start “X mins ago” shown in cli section. It could be due to multiple reasons, but main things
to verify:
Dell Customer Communication - Confidential

• Check if user has corrected permissions or is using correct credentials to access. Check
permissions in UI->Manage->Buckets, select namespace/bucket, edit Bucket, edit ACL and
review user ACL.
• Check for time on client side, if it is in sync with time on ECS nodes.

❖ Scenario 5: Customer is not able to write and getting HTTP 403, Method Forbidden error code

Things to check/do:
HTTP error code 403 may indicate “Method Forbidden” error as well. 403 error can be verified using
command “kpi.sh -s –start “X mins ago” shown in cli section. It’s mostly due to quota limit exceeded
for the bucket. Verify below things:
• From UI, check quota limit set for the bucket (UI->Manage->Buckets)
• From UI, check quota limit set for the namespace (UI->Manage->Namespace)
• Check current capacity utilization of bucket using Metering (UI->Monitor->Metering) or using
svc_bucket info <bucket_name>
• Increase quota limit if needed or inform client of usage limit
• Open a case with Dell EMC Support if the limit is not reached but a user is still getting quota
limit reached error.

❖ Scenario 6: Customer is not able to read few objects and getting HTTP 404 return code.

Things to check/do:
HTTP error code 404 means object is not found on ECS. You can verify below things:
• Run svc_request –on $OBEJCTNAME summary in question and confirm that 404 is returned
for GET operation for this object.
• Check if object was ever written to ECS using application logs.
• Check if last update on object has dmarker (If dmarker is true then it’s a deleted object and
404 is expected).
Dell Customer Communication - Confidential

❖ Scenario 7: Customer is not able to delete object HTTP 409 error code was returned.

Things to check/do:
When trying to delete an object, if you are getting 409 error, this means that object is under retention
period, and cannot be deleted.
• Verify bucket retention period using ECS REST API: GET
/object/bucket/{bucketName}/retention
• Check with bucket owner and modify policy if needed

Please note that retention can be set at Namespace, bucket and object level. The maximum retention
value will be enforced. So, we need to check the retention setting at all the three levels.

❖ Scenario 8: Customer was unable to write to ECS.

Things to check/do:

Using svc_dt check tool, status of the DT was checked, and it was found that all the DTs were ready.
Then using the kpi.sh -s –start “5 mins ago” script, error report was checked, and it was found that
only writes (PUT and POST) were hitting errors but not read (GET). Also, using svc_node tool service
status was also checked and none of the services were restarting.

Since reads were fine and DTs are ready and no service restarts. Capacity was checked using svc_vdc
capacity tool and it was found that there was no free space left and that’s why write were failing.

If the overall used capacity is at 90% then writes are not allowed. Please note that minimum of 3
nodes whose overall capacity is less than 90% is required for a successful write.

❖ Scenario 9: Application reports 501 errors.

Things to check/do:
Dell Customer Communication - Confidential

If you run the command kpi.sh -s -start '6 hours ago', it would report 501 errors in the summary
report.

In this instance, we found that the application was requesting for logging, requestPayment, tagging,
website from the dataheadsvc.log which are not supported/implemented and hence ECS throws 501
error.

Behavior is expected when the requested functionality is not implemented and its documented in the
error code page. Application should be updated to stop calling those APIs or expect 501 error code
from ECS.

Refer “Unsupported S3 API” section in the data access guide -


http://doc.isilon.com/ECS/3.5/DataAccessGuide/GUID-CA0B1CAA-35BA-433D-8EB3-
304DB47BE3CC.html

❖ Scenario 10: Customer reported 500 errors.

Things to check/do:

Customer had 5 nodes and due to capacity issue capacity expansion was done. Soon after the node
expansion was complete, customer started seeing 500 errors. kpi.sh -s -start “5 mins ago” script was
run to confirm 500 errors were being logged actively.

The expanded node was on the same ECS software version as others and there were no service
restarts or DT unready issue.

svc_network check all and latest version of xdoctor was run to detect that there was a duplicate IP
address in the network that was causing issues, customer shutdown the VM which was assigned with
the same IP address and after that 500 errors were no longer reported.

❖ Scenario 11: Customer reported 500 errors.

Things to check/do:

kpi.sh -s –start “5 mins ago” was executed to check the error status and found 17% of the error rate
(kpi.sh tool shows the error rate as well).

Using svc_dt check tool, status of the DT was checked and it was found that all the DTs were ready.

svc_network check all and svc_tso heartbeat reported connection issues to the remote VDC and if the
connection/heartbeat between the VDCs in federation is not working for 15 mins (default but it's
Dell Customer Communication - Confidential

configurable) then Temporary Site Outage (TSO) will be triggered. There was an issue with the switch
on customer side and vendor was engaged to resolve the network issue between the 2 VDCs.

svc_tso summary was run to check the tso status and found TSO condition. Once the network issue
was resolved system came out of TSO. The 500 errors were no longer reported.

❖ Scenario 12: One of the customer applications is not able to write to ECS.

• Things to check:

kpi.sh -s –start “5 mins ago” was executed and found all the requests were successful. End user was
requested to provide any one object name which they were not able to write and the bucket to which
it belongs to.

svc_request –on $OBEJCTNAME summary was run and found no request for this object so kpi.sh -s -
bucket $bucketname was run and found that there were no transactions at all for this bucket.

Further investigation on the load balancer side revealed that there was a network issue at Load
balancer which was causing the issue.

The issue got resolved after the network problem in load balancer was resolved.

Bucket Related Scenarios:


❖ Scenario 13: Customer is not able to delete bucket from ECS UI

Things to check/do:
Few important things to verify to delete a bucket:
• Make sure bucket is empty. If it’s not, use s3 browser (for a s3 bucket), or any other tool, to
delete the bucket contents first
• Check if user has sufficient permission to delete the bucket

❖ Scenario 14: Customer wants to know which bucket is highest on capacity/objects

Things to check/do:
Check “Top Buckets” dashboard in Advanced Monitoring. It shows list of buckets (sorted by capacity).
The capacity shown for each bucket is per vdc level i.e., how much data was written in this bucket on
this vdc.
Dell Customer Communication - Confidential

You can also view count of objects in each bucket:

You can get similar info using svc_bucket info <bucket_name>, but that’s federation level data, as
opposed to vdc level data in dashboard above.

Metering Related Scenarios:

❖ Scenario 15: End user complaining discrepancy in bucket utilization


Dell Customer Communication - Confidential

Things to check/do:

Using svc_bucket info <bucket_name> get the current object size and objects count. Alternatively, we
can get the same information from UI as well.
• login to ECS UI-->monitoring-->metering page.

Verify if the end user reported size and object count and what ECS is reporting are same, if not then
there is a metering discrepancy which is generally due to the following reasons.

• Incomplete MPU
• High number of non-current object versions.
• Compression of the Data at chunk layer.

Please contact DELLEMC support to for investigation into discrepancy take necessary action to correct
the metering discrepancy.

RPO/Replication Related Scenarios:


❖ Scenario 16: ECS UI shows RPO not up to date

Things to check/do:
Few important things to verify when RPO is not up to date:
• If RPO is in few seconds/minutes, it maybe that huge amount of data was recently ingested.
Wait for some time for data to be copied, and check RPO again
• Below screenshot from ECS UI shows that RPO is NOT Up-to-date.

• If RPO doesn’t come down and continues to increase, verify the replication network bandwidth
b/w VDCs.
• Using svc_replicate summary, check if tasks in geo replication queue are moving. If any node
doesn’t show any activity, it may have a problem.
• Open an SR with Dell EMC Support if RPO continues to show lag.
Dell Customer Communication - Confidential

UI Related Scenarios:
❖ Scenario 17: Customer logged in to UI and found a node offline. Also unable to ssh to the node in
question
Things to check/do:

• Check System Event Log (SEL) for any CATERR (catastrophic error) or Processor IERR (Internal
Error). Run this command from a good node against the BMC IP/private IPMI of the
problematical node:

getrackinfo -v (run this to get the BMC/private IPMI IP of the node in question)

sudo ipmitool -I lanplus -H <BMC IP/private IPMI> -U root -P passwd sel elist

If the node cannot be brought back online, please open a SR with Dell EMC.

❖ Scenario 18: End user complaining, bucket utilization in UI is not reducing after deleting the objects.
Things to check/do:
Dell Customer Communication - Confidential

As a first step, we will check if the User data and system metadata GC process is enabled. We can
check this from ECS UI.
• Login to ECS UI--> Monitor--> Capacity Utilization--->Garbage Collection(tab)

The high level deletes workflow in ECS is described below.

The garbage reclaim rate can be checked, using svc_gc rates reclaim.
Using svc_gc stats repo and svc_gc stats btree, ensure the reclaimable garbage is high (in TBs).

❖ Scenario 19: Customer is not able to ssh to the ECS node.


Things to check

Ping the ECS nodes from your workstation and verify the network connectivity is fine.

If yes, then login to UI and verify if you can login to ECS UI fine and check if any of the nodes are
reported as offline from UI(Monitor--> System Health-->Offline Nodes).

If no nodes are reported as offline then navigate to Settings--> platform locking and verify if the nodes
are locked, if the nodes are locked from platform then you will not be able to ssh to ECS nodes.
Dell Customer Communication - Confidential

If all the nodes are unlocked, then verify if you can ssh from other workstation.

Scenario 20: Customer reported that ECS is not dialing home.


Things to check/do:

Login to ECS UI, in Settings tab verify that ESRS server is reported connected. If it’s not, then verify the
network connectivity between ECS and ESRS server. If it is showing connected, then we can fire a test
dial home alert.

If dial home alert is still not received, then please contact DELLEMC support team.

Object Lifecycle Related Scenarios:

❖ Scenario 21: Objects not expiring even after setting a life-cycle policy for a bucket.
Things to check/do:

Using svc_bucket info <bucket_name> verify that bucket policy is properly applied, like the name of
the policy setup is same as what is applied to this bucket in question also the expiry date in the bucket
policy.

If its correct but still the objects are not expired, then please contact DELLEMC support for further
investigation.
Dell Customer Communication - Confidential

Additional Information

ECS Product Support: https://www.dell.com/support/home/en-us/product-support/product/ecs-appliance-


/docs

It includes Knowledge Base articles, manuals and documents.

You might also like