100% found this document useful (1 vote)
296 views

ECS - ECS Miscellaneous How To Service Procedures-ECS Troubleshooting Procedures

Uploaded by

ali2k2sec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
296 views

ECS - ECS Miscellaneous How To Service Procedures-ECS Troubleshooting Procedures

Uploaded by

ali2k2sec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

ECS ™ Procedure Generator

Solution for Validating your engagement

ECS Troubleshooting procedures

Topic
ECS Miscellaneous 'How To' Service Procedures
Selections
Choose Activity: ECS Troubleshooting Procedures

Generated: July 7, 2022 6:25 PM GMT

REPORT PROBLEMS

If you find any errors in this procedure or have comments regarding this application, send email to
[email protected]

Copyright © 2022 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell
EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be
trademarks of their respective owners.

The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of
any kind with respect to the information in this publication, and specifically disclaims implied warranties of
merchantability or fitness for a particular purpose.

Use, copying, and distribution of any software described in this publication requires an applicable
software license.

This document may contain certain words that are not consistent with Dell's current language guidelines.
Dell plans to update the document over subsequent future releases to revise these words accordingly.

This document may contain language from third party content that is not under Dell's control and is not
consistent with Dell's current guidelines for Dell's own content. When such third party content is updated
by the relevant third parties, this document will be revised accordingly.

Publication Date: July, 2022

Dell Technologies Confidential Information version: 2.3.6.91

Page 1 of 87
Contents
Preliminary Activity Tasks .......................................................................................................3
Read, understand, and perform these tasks.................................................................................................3

ECS Troubleshooting Guide v1.12..........................................................................................5

Dell Technologies Confidential Information version: 2.3.6.91

Page 2 of 87
Preliminary Activity Tasks
This section may contain tasks that you must complete before performing this procedure.

Read, understand, and perform these tasks


1. Table 1 lists tasks, cautions, warnings, notes, and/or knowledgebase (KB) solutions that you need to
be aware of before performing this activity. Read, understand, and when necessary perform any
tasks contained in this table and any tasks contained in any associated knowledgebase solution.

Table 1 List of cautions, warnings, notes, and/or KB solutions related to this activity

2. This is a link to the top trending service topics. These topics may or not be related to this activity.
This is merely a proactive attempt to make you aware of any KB articles that may be associated with
this product.

Note: There may not be any top trending service topics for this product at any given time.

ECS Top Service Topics

Dell Technologies Confidential Information version: 2.3.6.91

Page 3 of 87
Dell Technologies Confidential Information version: 2.3.6.91

Page 4 of 87
ECS Troubleshooting Guide v1.12

Note: The next section is an existing PDF document that is inserted into this procedure. You may see
two sets of page numbers because the existing PDF has its own page numbering. Page x of y on the
bottom will be the page number of the entire procedure.

Dell Technologies Confidential Information version: 2.3.6.91

Page 5 of 87
Troubleshooting guide

ECS Troubleshooting Guide v1.12


Leveraging the UI and CLI

Abstract
This document will assist in basic troubleshooting steps for ECS. It will walk
through what to look for in the UI initially (not all inclusive), as well as some basic
CLI read-only commands. It also covers the Advanced Monitoring (Grafana) UI
that was introduced in ECS 3.4.0.0.

June 2021

Troubleshooting Guide

Page 6 of 87
Revisions

Revisions
Date Description
July 2021 Updated document with more troubleshooting steps.

January 2021 Initial release.

Acknowledgments
Author: Dell Technologies

The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this
publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.

Use, copying, and distribution of any software described in this publication requires an applicable software license.

This document may contain certain words that are not consistent with Dell's current language guidelines. Dell plans to update the document over
subsequent future releases to revise these words accordingly.

This document may contain language from third party content that is not under Dell's control and is not consistent with Dell's current guidelines for Dell's
own content. When such third party content is updated by the relevant third parties, this document will be revised accordingly.

Copyright © 2021 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell
Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. [7/9/2021] [Troubleshooting guide]

Troubleshooting Guide

Page 7 of 87
Table of contents

Table of contents
Revisions.............................................................................................................................................................................2
Acknowledgments ...............................................................................................................................................................2
Table of contents ................................................................................................................................................................3
Disclaimer ...........................................................................................................................................................................5
Summary .............................................................................................................................................................................5
Pre-Requisites ....................................................................................................................................................................5
Using the UI ........................................................................................................................................................................5
1.1 View Current Alerts .............................................................................................................................................5
1.2 Node and Disk Health .........................................................................................................................................5
1.3 Capacity Utilization .............................................................................................................................................6
1.4 Garbage Collection .............................................................................................................................................6
1.5 Check Requests .................................................................................................................................................7
1.6 Performance .......................................................................................................................................................8
1.7 Process Health ...................................................................................................................................................8
1.8 Recovery Status .................................................................................................................................................9
1.9 RPO Status .......................................................................................................................................................10
1.10 Disk as a Customer Replaceable Unit ..............................................................................................................10
1.11 Advanced Monitoring (Grafana) .......................................................................................................................12
2 Using the CLI - Leveraging Service Tools..................................................................................................................24
2.2 Check Directory Tables (DTs) ..........................................................................................................................32
2.3 Service Restarts ...............................................................................................................................................34
2.4 Replication Status and TSO .............................................................................................................................36
2.5 Capacity ............................................................................................................................................................39
2.6 Space Reclamation/Garbage Collection ..........................................................................................................40
2.7 Networking ........................................................................................................................................................42
2.8 Alerts .................................................................................................................................................................44
2.9 Log collection ....................................................................................................................................................48
3 xDoctor .......................................................................................................................................................................50
3.1 sudo xdoctor -h .................................................................................................................................................50
3.2 Search for xDoctor rpm on Dell support site ....................................................................................................50
3.3 Download latest version (direct link) (v68 as of Nov 2020): .............................................................................51
3.4 Upgrade to the version in questions via: ..........................................................................................................51
3.5 How do I configure xDoctor to send xDoctor Reports to Customers via Email? ..............................................53
4 How to configure ECS to send required information to syslog...................................................................................57

Troubleshooting Guide

Page 8 of 87
Table of contents

5 Monitoring ECS from SLACK .....................................................................................................................................63


6 ECS Test Drive ...........................................................................................................................................................65
7 ECS repository on GITHUB........................................................................................................................................66
7.1 ECSSync ..........................................................................................................................................................66
7.2 Code Samples ..................................................................................................................................................66
7.3 Mongoose .........................................................................................................................................................66
7.4 Tools .................................................................................................................................................................67
8 Real World Examples .................................................................................................................................................68
8.1 Performance Related Scenarios.......................................................................................................................68
8.2 Object Read/write Related Scenarios ..............................................................................................................69
8.3 Bucket Related Scenarios ................................................................................................................................72
8.4 Metering Related Scenarios .............................................................................................................................74
8.5 RPO/Replication Related Scenarios ................................................................................................................74
8.6 UI Related Scenarios ........................................................................................................................................75
8.7 Object Lifecycle Related Scenarios ..................................................................................................................78
8.8 Certificated Related Scenarios .........................................................................................................................79
8.9 DellEMC Data Domain/ECS Related Scenarios ..............................................................................................79
8.10 Base url Related Scenarios ..............................................................................................................................79
8.11 Erasure Coding Related Scenarios ..................................................................................................................80
9 Additional Information .................................................................................................................................................81

Troubleshooting Guide

Page 9 of 87
Disclaimer

Disclaimer
This is not a replacement for Dell EMC Customer Service and/or Engineering. Please open a support case
when experiencing any issues with ECS. Troubleshoot at your own risk.

Summary
This document will assist in basic troubleshooting steps for ECS. It will walk through what to look for in the UI
initially (not all inclusive), as well as some basic CLI read-only commands.

It also covers the Advanced Monitoring (Grafana) UI that was introduced in ECS 3.4.0.0.

Pre-Requisites
Login credentials and access to the ECS UI and ssh to ECS nodes via CLI.

Using the UI
Below is a list of what to look for when users report that there may be issues with ECS.

Note: Depending on the version of ECS, you can also launch an Advanced Monitoring (Grafana) UI (ECS
3.4.0.0 and above).

1.1 View Current Alerts


Click on Monitor | Events | Alerts tab.

In order to understand what each alert means, reference the latest Monitoring Guide.

Items to look for include node failure, disk failure, RPO lag time, and failover events.

Events

1.2 Node and Disk Health


Click on Monitor | System Health | Hardware Health | All Nodes and Disks (drill down)

Make sure that all the nodes and disk health is “Good”, looking for keywords such as “Bad”, “Missing”,
“Removed”, “Suspect”.

Troubleshooting Guide

Page 10 of 87
Using the UI

System Health

1.3 Capacity Utilization


Click on Monitor | Capacity Utilization

Check to make sure that ECS is not pushing capacity thresholds (read-only at 90%).

Drill down on the VDC to start investigating capacity on nodes/disk, keeping in mind the load should be
distributed. Look at trends and forecasting as well.

Capacity Utilization

1.4 Garbage Collection


If there is a concern with the Capacity Used on an ECS VDC and we can review to verify that Space
Reclamation (SR) is working properly.

Troubleshooting Guide

Page 11 of 87
Using the UI

ECS UI has a section reporting various useful details regarding GC/SR, see below.

Capacity Utilization

User data GC is called Repo SR and System metadata SR is combination of both Btree and Journal SR.

If capacity pending reclamation is high and unreclaimable garbage is comparatively low then as a first step,
we need to run svc_gc a cli tool (details in CLI section below) to do a basic troubleshooting.

Unreclaimable garbage is the garbage detected in the system, which is not eligible for reclamation.

In ECS we have a concept of partial SR. If a chunk (ECS Technical FAQ for information about a chunk) has
2/3rd of garbage data (by default and it configurable based on the situation) then such a chunk is eligible for
reclamation. ECS internally moves the 1/3rd of valid data to the new chunk and reclaims the eligible garbage
chunk.

So, if the chunk/garbage which does not meet these criteria will be reported as unreclaimable garbage here.

1.5 Check Requests


Click on Monitor | Transactions | Requests tab

This will show code numbers for various head services (S3, CAS, etc.).

Understanding the HTTP code numbers for S3:

400 = Application Error (User Failures)

500 = System Error (System Failures)

Look for a high number of error codes. If consistently high 500 errors, look at Directory Table (DT) status
(discussed later in this document). If there are consistently high 400 errors, work with application teams to
check things such as permissions, certificates, networking etc.

Troubleshooting Guide

Page 12 of 87
Using the UI

Transactions

1.6 Performance
Click on Monitor |Transactions |Performance tab

Review the latency, bandwidth, and TPS metrics provided. Look for prolonged spikes as well as any
sustained increases. If there are sustained increases, check DT status.

Also, drill down on the VDC to make sure the nodes are being utilized evenly and there isn’t a potential issue
with load balancing.

Transactions

1.7 Process Health


Click on Monitor |System Health |Process Health

This allows you to check the health and status of CPU, memory, and NIC performance.

Troubleshooting Guide

Page 13 of 87
Using the UI

System Health

Keep in mind memory typically runs relatively high. Therefore, if it’s at a sustained higher level this may not be
a cause for concern.

Drill down into each node to view the various services, look for any reoccurring spikes or high %. Some of the
more critical ones are blobsvc (data operations header), cm (chunk manager), sr (space reclamation) and
objcontrolsvc.

Also, when looking at each node, review the restarts to make sure certain services aren’t continuously
bouncing.

1.8 Recovery Status


Click on Monitor |Recovery Status

Recovery is the process of rebuilding data after any local condition that results in bad data (i.e. bad chunks). It
is good to ensure that there is not a significant backlog here.

Recovery Status

Troubleshooting Guide

Page 14 of 87
Using the UI

1.9 RPO Status


Click on Monitor |Geo Replication

Review pending replication and make sure RPO (Recovery Point Objective) is up-to-date or close to it. If
there is significant lag it could be indicative of an ECS and/or network issue that needs to be investigated.

Geo Replication

1.10 Disk as a Customer Replaceable Unit


Click on Manage | Maintenance

From ECS 3.5 version, we have introduced a new feature called CRU. The disk will be replaceable by
customer, this was FRU (field/DELLEMC) replaceable until 3.5. This feature makes it simple for end users to
replace HDD and SSDr (read cache) disks themselves through the UI – WITH ONE CLICK OF A BUTTON.

Replacement drives will be ordered automatically and shipped to the customer if customer site has Remote
Services (SRS) configured. Supported HW configurations: All Gen3 (EX300, EX500, EX3000) and Gen2 U-
Series.

Under Manage, a new page Maintenance is created to manage CRU feature. Please see the below
screenshot for information.

Before the Failure

Maintenance

Troubleshooting Guide

Page 15 of 87
Using the UI

Maintenance

After Failure: ONE DISK SUSPECT, ONE FAILED

System Health

As you can see below, the 2 disks are reported as yellow. Click on yellow disk count icon to see the status of
disk recovery.

Maintenance

Troubleshooting Guide

Page 16 of 87
Using the UI

See the disk recovery status below.

Maintenance

After the disk Recovery is completed automatically – disks are ready to REPLACE and we will see the alerts
like below.

Events

Disks are now ready to replace and you can see that in the maintenance tab, see below.

Click on the replace button and follow the onscreen instruction to complete the process. The disk which needs
to be replaced will have LED lit for easy identification as well.

Maintenance

1.11 Advanced Monitoring (Grafana)


Click on Advanced Monitoring section in UI. It will redirect to a new page with Grafana dashboards.

Troubleshooting Guide

Page 17 of 87
Using the UI

Dashboards

List of dashboards present can be viewed by clicking dashboard name on top. The dashboards which were
accessed recently will show up in “Recent” folder. If ECS version < 3.5, then OE dashboards aren’t visible by
default. They need to be enabled using below SC command:

service-console run Configure_Grafana_Dashboards --enable-oe-dashboards true --target-node 169.X.X.X

(where 169.X.X.X is private.4 for the ECS node). It can be disabled by setting the value as “false” in the
above command.

Troubleshooting Guide

Page 18 of 87
Using the UI

Dashboards

Note: To view “OE Dashboards”, you need to login using the “emcservice” account.

Note: GC/SR related dashboards are available from ECS 3.6 version onwards.

The dashboards provide an overview of the status of system in various fields. By default, they show data for
the last 24 hours for most of the reports. It can be modified by clicking on selected time range (“Last 24
Hours”) here:

Troubleshooting Guide

Page 19 of 87
Using the UI

Quick Ranges

Some frequently accessed dashboards are discussed below:

Troubleshooting Guide

Page 20 of 87
Using the UI

1.11.1 Data Access Performance – Overview

Data Access Performance

This shows a summary of user requests in selected time range (on top right). For example, above shows that
system has:

1. Number of Successful requests = 174,845,496


2. Failures due to server-side issue = 818
3. Failures due to client-side issue = 200
4. Failures % = 0.001

It also provides a summary of latency as well. For example, the above shows that the system has:

1. Read requests have p50 = 10.86 ms


2. Read requests have p90 = 2.57 s

Note: Above value of p50 means 50% of total read requests took less than 10.86 ms. A value of 2.57 seconds
for p90 means 90% of user requests took less than 2.57 seconds. It doesn’t mean some requests really took
2.57 seconds. It simply means 99% requests took less than 2.57 ms.

The graph is plotted with values using 5 mins interval. So, the bandwidth graph provides information on
read/write requests size for every 5 minutes. The legend also provides a summary of max/min/avg values for
selected time range.

In same dashboard, we can further drill down on the type of requests. For example, below will categorize the
successful/failed requests based on method (GET/PUT/HEAD etc.) or protocol (S3/CAS etc.) or error code
(500/404 etc.)

Troubleshooting Guide

Page 21 of 87
Using the UI

Sucessful Requests Drill Down

Succesful requests/s by Protocol

This “Data Access Performance” dashboard is also available with namespaces, nodes and protocols category
(separate dashboard for each).

Troubleshooting Guide

Page 22 of 87
Using the UI

1.11.2 Process Health – Overview

Process Health Overview

This shows a summary of resource utilization at vdc level in selected time range (on top right). For example,
the above shows that system has:

1. Avg. CPU Utilization = 5.19%


2. Avg. Memory Usage = 43.39 GiB
3. Relative Memory = 70.66%

There are graphs below which show a trend of each of the above metrics over selected time range.

The “Process Health” dashboard is also available with process and nodes category (separate dashboard for
each).

Like “Data Access Performance” and “Process Health”, there are dashboards available to monitor disks’
health as well: “Disk Bandwidth – Overview" and “Disk Bandwidth – By Nodes”.

1.11.3 Tech Refresh


From ECS version 3.5 onwards, we have introduced a new feature to retire the EOL hardware and migrate
the data to a new ECS hardware. Please contact Professional services or your account representative if you
have such a requirement.

This process can be monitored via the Grafana Dashboard.

1. In the ECS UI, go to the Grafana Dashboard: Go to Advanced Monitoring


2. Select Tech Refresh: Data Migration from the pulldown menu at the top of the page

Tech Refresh:Data Migration

The dashboard provides various migration details, see the screen shot below. The amount of data migrated
from source (to be retired nodes) to the new nodes, migration speed, time to completion etc.

Troubleshooting Guide

Page 23 of 87
Using the UI

This Grafana dashboard will be very useful to do basic troubleshooting to see if the data migration is progress
or not, is it happening for all the nodes etc.

E.g.

Tech Refresh:Data Migration

1.11.4 SSDr Read Cache


Metadata in ECS are stored in directory tables (DT) as key-value pairs. Internally, ECS uses chunks to store
both data and metadata and the chunks reside on hard drives. When doing metadata read, it needs to fetch it
from hard drives as well and the chunks are distributed to every node in the cluster which also introduces the
network latency. So, from 3.5 onwards a new feature called SSD read cache to leverage SSD drive boost
improvements mainly for metadata read/write.

New SSD drives needs to be ordered for activate this feature. SSD read cache is supported only in a VDC
where all the nodes are the following hardware types:

• Gen3 EX-Series
• Gen2 U-Series

Please contact Professional services or your account representative if you have such a requirement.

Once the new SSD drives are inserted and feature is enabled, we can monitor critical parameters from
Grafana Dashboard.

1. In the ECS UI, go to the Grafana Dashboard: Go to Advanced Monitoring


2. Select SSD Red Cache

SSD Read Cache

Troubleshooting Guide

Page 24 of 87
Using the UI

Latency Numbers:

Latency

Disk Usage & Capacity:

Disk Usage

1.11.5 OE Dashboards
These dashboards are only available with emcservice/emcmonitor account. They provide further insights in
ECS which can help in troubleshooting:

Troubleshooting Guide

Page 25 of 87
Using the UI

OE Dashboards

Some OE Dashboards are discussed below:

1.11.5.1 (OE) DT status


This dashboard provides the status of Directory Tables (DTs).

(IOE) DT Status

Troubleshooting Guide

Page 26 of 87
Using the UI

DT Status per DT type

The first graph shows status of all DTs over selected time range. The second graph shows the
unready/unknown (if any) DT type and count.

If a user/client complains of access/latency issues, the first thing is to check for DT status during that time
period which can be quickly found out using above graphs. The output is like “svc_dt check” discussed in the
CLI section later in the document. Unready DTs can be caused by to service restarts too. You can compare
DT unready time with service restarts to see if they are related events.

The dashboard also shows DT distribution at node level:

DT Distribution

1.11.5.2 (OE) Service Restarts


As the name suggests, this dashboard provides an overview of service restarts happening on cluster.

(IOE) Service Restarts

Troubleshooting Guide

Page 27 of 87
Using the UI

Above graph tells that there were few service restarts in selected time range. The legend below in the graph
shows the service name and total count of restarts in the time range. The names of the corresponding
hostnames (on which service restarts happened) can also be found in the legend.

When troubleshooting for any performance issue, we should always first check for any service restarts and
DT status for that time period, and above graph can help in that.

On top left, we also have a dropdown for hostnames and service names which can be further used to monitor
for restarts on a specific host or for a service.

1.11.5.3 (OE) Processes on host


This dashboard shows the resource utilization by ECS services on each host:

(OE) Processes on Host

Above graph shows the memory and swap utilization by each service. It also shows number of open fds by
each service.

In the same dashboard, there are few other graphs which show thread count, CPU utilization and disk IO by
each service in selected time range.

1.11.5.4 (OE) Node system metrics


This is like “Processes on host” but it provides the resource utilization at node level rather than at process
level. We can monitor memory, swap, fds and disk space usage at node level:

(OE) Node System Metrics

Troubleshooting Guide

Page 28 of 87
Using the UI

2 Using the CLI - Leveraging Service Tools


The service tools are installed by default and can be run from any directory.

2.1.1 svc_version
The script, svc_version, can be run to check the ECS version and also other components as you can see in
the below screen shot.

svc_version

Troubleshooting Guide

Page 29 of 87
Using the UI

2.1.1.1 svc_version –h

svc_version-h

2.1.2 KPI Script


The KPI script will provide various metrics around key performance indicators within ECS such as number of
requests, latency, and MB/s among others. These have various options that can be set to give different
outputs. Every environment is different, therefore, it’s important to run these commands frequently in order to
baseline what normal behavior looks like.

The output that you will see with this command includes the following - look for high latency or poor
performance that may be impacting:

• Overall Request Latency (ms)


• Request Latency Distribution (number of requests in each range)
• Request Sizes
• GET Latency Distribution (per request size)
• PUT Latency Distribution (per request size)
• Rate Statistics (per node)

Troubleshooting Guide

Page 30 of 87
Using the UI

• GET Extended stats (per request size)


• PUT Extended stats (per request size)
• Ingest Statistics (per node)

Typically, during initial troubleshooting, the option -min is set to “x minutes ago” or n for “x hours ago”.
Another common option is -s, which gives a shortened summary output shown below.

View these by running help:

2.1.2.1 kpi.sh -h

kpi sh - h

Without specifying any options, the default output is based on the past 60 minutes and displays the long form
output.

Troubleshooting Guide

Page 31 of 87
Using the UI

2.1.2.2 kpi.sh -s -min 30

kpi sh – s – min 30

Here it is important to look at a balance across nodes and if you see a large amount of 500 errors. Typically, a
DT issue will impact all nodes.

The command below combines a few options where 403 errors (for example) are gathered during a specific
five-minute time period:

2.1.2.3 kpi.sh -s -start '2 days ago' -end 'now' -errs

kpi.sh -s -start '2 days ago' -end 'now' -errs

Troubleshooting Guide

Page 32 of 87
Using the UI

You can also run this command against a particular bucket if you know what the application is using.

2.1.2.4 kpi.sh -s

Kpi sh - s

2.1.2.5 kpi.sh -s -cas (applicable for customers that use CAS)

kpi sh – s – cas

2.1.2.6 ECS CAS error codes


http://doc.isilon.com/ECS/3.5/DataAccessGuide/GUID-E6C318F6-E2FB-438E-AF96-
016EC52D9048.html?hl=ecs%2Ccas%2Cerror%2Ccodes

Value Error Name Description

Troubleshooting Guide

Page 33 of 87
Using the UI

10020 FP_NO_POOL_ERR It was not possible to establish a connection with a cluster.


The server could not be located. This means that none of
the IP addresses could be used to open a connection to
the server or that no cluster could be found that has the
required capability. Verify your LAN connections, server
settings, and try again.
10021 FP_CLIP_NOT_FOUND_ERR FPClip_Open(), it means the CDF could not be found on
the server. Verify that the original data was correctly stored
and try again.
10036 FP_BLOBIDMISMATCH_ERR The blob is corrupt: a BlobID mismatch occurred between
the client and server. The Content Address calculation on
the client and the server has returned different results. The
blob is corrupt. If FPClip_Open() returns this error, it
means the blob data or metadata of the C-Clip is corrupt
and cannot be decoded.
10101 FP_SOCKET_ERR An error on the network socket occurred. Verify the
network.
10153 FP_AUTHENTICATION_FAILED_ERR Authentication to get access to the server failed. Check the
profile name and secret.
10201 FP_OPERATION_REQUIRES_MARK The application requires marker support, but the stream
does not provide that.
10204 FP_OPERATION_NOT_ALLOWED The use of this operation is restricted, or this operation is
not allowed because the server capability is false.
10020 FP_NO_POOL_ERR FPClip_Open(), it means the CDF could not be found on
the server. Verify that the original data was correctly stored
and try again.

2.1.2.7 ECS S3 error codes


http://doc.isilon.com/ECS/3.2/DataAccessGuide/ecs_r_s3_error_codes.html

HTTP
Error Code Status Generic Error Code Description Error
Code
AccessDenied 403 AccessDenied Access Denied
BadDigest 400 BadDigest The Content-MD5 you specified did not
match that received.
BucketAlreadyExists 409 BucketAlreadyExists The requested bucket name is not
available. The bucket namespace is
shared by all users of the system. Please
select a different name and try again.
BucketNotEmpty 409 BucketNotEmpty The bucket you tried to delete is not
empty.
ContentMD5Empty 400 InvalidDigest The Content-MD5 you specified was
invalid.
ContentMD5Missing 400 InvalidRequest The required Content-MD5 header for this
request is missing.

Troubleshooting Guide

Page 34 of 87
Using the UI

EntityTooSmall 400 EntityTooSmall The proposed upload is smaller than the


minimum allowed object size.
EntityTooLarge 400 EntityTooLarge The proposed upload exceeds the
maximum allowed object size.

Troubleshooting Guide

Page 35 of 87
Using the UI

2.1.3 SVC_REQUEST
svc_request -h

svc_request -h

Troubleshooting Guide

Page 36 of 87
Using the UI

2.1.3.1 svc_request -s 404 error summary


We can filter the errors bases on the type of http request with the -t option in the above command, below
example shows the error summary for different types of HTTP operations which returned 404 error.

SVC_Request-s 404 error summary

2.1.3.2 svc_request -on 465abb83-5804-4fb5u97ee-0c5f0a9b9395 summary


If we know the object name in question, then we can search the transactions related to that particular object
with -on (object name) option (see below). This object was uploaded using multi part upload so we can see all
the transaction details for this object.

SVS_Reuqest- in 465abb83-5804-4fb5u97ee-0c5f0a9b9395 summary

2.2 Check Directory Tables (DTs)


One of the most common items to check when experiencing issues on the ECS is the status of the DT tables.

ECS stores the metadata of important artifacts like bucket, namespace, and object in the form of "Directory
Tables" (DTs). Typically, the DTs are comparable to a database regarding traditional applications. There are
several Directory Tables in ECS that store specific types of data:

Troubleshooting Guide

Page 37 of 87
Using the UI

• OB - Object table. The object related information.


• LS - Listing table. The listing entry related information. For example, all keys under one bucket have one
entry in the LS table. S3 bucket listing requests will go to the LS table.
• CT - Chunk table.
• BR - Btree Reference table.
• SS - Storage Space table. Maintains the disk block usage (allocation/free) information.
• PR - Partition Record table. Stores the DT record information.
• RT - Resource table. This is a special system table for the system related information, such as replication
group, namespace, and bucket.
• ET - Event table. This table is used to store system events like AUDITs and ALERTs.
• MA - Metering Aggregate table. Saves the aggregated metering information.
• MR - Metering Raw table. Saves the raw metering information that is later aggregated in MA table.
• RR - Repo chunk reference table. Contains Repo chunk (Object chunk) reference information.

2.2.1 svc_dt -h

svc_dt - h

Look to see if any DTs are unknown or in an unready state. If you see any that are down or haven’t been
checked in recent time, open a case and inform support. Occasionally, there may be one that is unready,
however, if you see it’s sustained over multiple checks, open a case and inform support. Note that eight or
more unknown or unready DTs triggers an Alert which is sent to Dell EMC.

2.2.2 svc_dt check


svc_dt check –f option can be used to query the DT status manually but as you can see below, without the -f
option the Auto DT check status is reported. The timestamp on the left is important to be noticed the latest
DT status will be reported on the top.

Troubleshooting Guide

Page 38 of 87
Using the UI

svc_dt hceck

Another useful DT command is “svc_dt dist” which shows how balanced the DTs are across all the nodes in
the VDC (ECS cluster). Note that the output should be “well” balanced based on the number of nodes in your
VDC. A node with very low or no DTs assigned is an indication of a node problem.

2.2.3 svc_dt dist (use -f for a real-time check)

svc_dt dist

2.3 Service Restarts


Continuous service restarts can also have an impact to ECS health, so it is important to see if any are getting
restarted. Some of the ones to focus on include the following:

• blobsvc – Manages the following tables: Object (OB), Listing (LS), and Repo Chunk Reference (RR)
• cm - Manages the following tables: Chunk (CT), Btree Reference (BR). Provides the logic to handle
various events based on the chunk's current state and decide which state to transition to next.
• objcontrolsvc - Provides REST APIs for configuring the ECS cluster, managing ECS resources, and
monitoring the system.

Troubleshooting Guide

Page 39 of 87
Using the UI

• vnest - Provides distributed synchronization and group services. A subset of data nodes will be group
members responsible for serving the key/value requests. VNest services running on other nodes will
listen for configuration updates and be ready to be added to the group.

2.3.1 svc_node -h

svc_node - h

If these are getting restarted repeatedly, then there will be an impact to I/O and a SR should be opened.

Troubleshooting Guide

Page 40 of 87
Using the UI

2.3.2 svc_node services -show restarts

svc_node services – show restarts

2.4 Replication Status and TSO


For statistics (outside the UI) on determining potential issues with replication, there are a couple of commands
that provide additional detail. Typically, this command is used after an RPO alert has been triggered and the
UI shows that something may be stuck. Common issues that affect replication are WAN outages or WAN
saturation.

2.4.1 svc_replicate -h

svc_replicate – h

When running a summary, it is important to look at current rates (per node), TSO, and what is pending
(typically, pending chunks never reach zero since there is always something replicating).

Troubleshooting Guide

Page 41 of 87
Using the UI

2.4.2 svc_replicate summary

svc_replicate summary

In order to check the current Temporary Site Outage (TSO) state, the command below and its options provide
insight to the TSO state along with heartbeat and task status:

Troubleshooting Guide

Page 42 of 87
Using the UI

2.4.3 svc_tso -h

svc_tso - h

2.4.4 svc_tso summary

svc_tso summary

Troubleshooting Guide

Page 43 of 87
Using the UI

2.5 Capacity

2.5.1 svc_vdc capacity

svc_vdc_capacity

Troubleshooting Guide

Page 44 of 87
Using the UI

2.5.2 svc_vdc trend

svc_vdc_trend

2.6 Space Reclamation/Garbage Collection


To check further detail around garbage collection statistics (outside the UI), there is a command that can
provide a break down. This includes numbers around the two different types of garbage collection that run on
the ECS, repo (user data) and btree (metadata/index).

2.6.1 svc_gc -h

svc_gc – h

If there is concern about the rate at which deleted data is reclaimed, the rates reclaim option below will
display the daily reclaim rate for repo and btree data. For example, if your applications are deleting 1TB per
day and the reclaim rate is only 1GB per day, open an SR to investigate further.

Troubleshooting Guide

Page 45 of 87
Using the UI

2.6.2 svc_gc rates reclaim

svc_gc rates reclaim

When looking at repo, you can see the stats command will provide two sections of output.

The first will cover statistics broken down by full and partial garbage related to capacity. The other will do the
same but in chunks.

Keep in mind full garbage is when an entire chunk (128MB in size) is marked for 100% deletion. Partial
garbage is when a chunk is marked for deletion but less than a 100%. For example, you can have a chunk
that is 1/4 marked for deletion or 1/2 marked for deletion.

Furthermore, there are two types of partial garbage referred to as eligible and ineligible. Partial eligible is
when a chunk has been marked for at least 2/3 deletion. In this case, ECS will take the remaining 1/3 and
move it to another chunk which frees up 100% of the original chunk. Partial ineligible is when the chunk is
marked for less than 2/3rds deletion, in which case it will remain on the system until it meets the defined
threshold.

It is important to notice if you have a large amount of garbage stuck in reclaim (especially if it continuous to
increase rather than decrease). This information will help support understand if something may be stuck or if
various parameters should be changed/modified.

2.6.3 svc_gc stats repo

svc_gc stats repo

Stats can also be run for btree as well:

Troubleshooting Guide

Page 46 of 87
Using the UI

2.6.4 svc_gc stats btree

svc_gc stats btree

2.7 Networking
Although there are some networking statistics in the UI, failures are not one of them. However, there are
various statistics that can be pulled using the CLI (xDoctor also has alerts).

2.7.1 svc_network -h

svc_network - h

To check if a NIC is down or unavailable, run the following command. The screen shot below is for one node,
but all nodes are displayed when running the command.

Troubleshooting Guide

Page 47 of 87
Using the UI

2.7.2 svc_network show int

svc_network show int

2.7.3 svc_network show int

svc_network show int

2.7.4 svc_network check all


Network check within the VDC(LOCAL) from where svc_network was triggered. In the subsequent screenshot
below, you can see the network connectivity status between the VDCs.

svc_network check all

Troubleshooting Guide

Page 48 of 87
Using the UI

svc_network check all

2.7.5 svc_network summary

svc_network summary

2.8 Alerts
We have alerts tab in the UI but managing these alerts from UI can be quite a task, so we have a CLI tool to
mail some of these alerts to user mails.

Troubleshooting Guide

Page 49 of 87
Using the UI

2.8.1 svc_alert
Tool to display, filter, clear, and send email notifications for system alerts.

Please update the xDr version to 4.8-74 or above to get latest changes/enhancements done to this tool which
are discussed below.

If we are using this tool for the first time then we need to update the alerts_conf.json file, a skeleton
alert_conf.json file is created with the below command, see the screenshot below.

2.8.2 svc_alerts mail --init

svc_alerts mail –init

The alert_conf.json needs to be update with recipents_mailId, alert type and severity, smtp sender mail ID
and IP address.

E.g.

Troubleshooting Guide

Page 50 of 87
Using the UI

2.8.3 svc_alerts list


It lists all alerts from last 1 month, see below.

svc_alerts list

2.8.4 svc_alerts summary


It prints the summary of alerts that are generated in the last 1 month.

svc_alerts summary

2.8.5 svc_alerts mail


This option will query the alerts which are generated from last 1 hour and email it to the recipients configured
in the alert_conf.json file which we discussed previously. However, if you want to query the alerts for longer
period then you can use the option as shown in the screenshot below.

svc_alerts mail

Troubleshooting Guide

Page 51 of 87
Using the UI

The sample alert mail will look like below:

Sample alert mail

2.8.6 svc_alerts mail_kpi


This option will send the KPI details (No of PUT/GET/POST/DELTE/HEAD requests) and if there are any
HTTP errors(500/404/400/403.etc) errors reported for the queried time. By default, the query is for last 1 hour.
The start and stop time can be specified like below.

svc_alerts mail_kpi

Troubleshooting Guide

Page 52 of 87
Using the UI

The sample alert mail will look like below

Sample alert mail

2.9 Log collection


Since the logs in ECS rotate out, it’s important to collect and preserve the logs for the time when and incident
occurred, it will help in doing the root cause of the problem. So, we have svc_collect tool to collect and
preserve logs/configuration/various command output from ECS system.

Log collection in a distributed system like ECS is very tedious task and svc_collect make it easier for collect
the logs for a specific duration and store it in compressed format. This tool is making easier to use

2.9.1 Best Practices:


• Use the log collection duration as small as possible and mention the duration of the issue while collect.
• Advised to use the external system for uncompressing the log files
• Please chose last non-vnest nodes and which has higher root disk space

2.9.2 svc_collect collect -h


We can specify the start and stop time for which we need to collect the logs/config but by default, it will collect
the logs/configs for last 20 minutes. We can specify the sr(services) option to get logs for specific service, like
dataheadsvc.

Troubleshooting Guide

Page 53 of 87
Using the UI

If there is a requirement just to collect the logs and not cofigs & commands, then -nocfg & -nocmd options can
be used.

This tool needs to be run on a non vnest member, please run svc_vnest members to choose the non vnest
member nodes to run the tool.

The logs that match the criteria from all the nodes are zipped and it will be stored under /tmp/

Filename: /tmp/svc_collect-SystemTest-20210526_090935.ziz from the below screenshot.

/tmp/svc_collect-SystemTest-20210526_090935.ziz

Troubleshooting Guide

Page 54 of 87
Using the UI

3 xDoctor
xDoctor is a tool used by Dell Customer Support to monitor, report on, and troubleshoot the health of your
ECS Appliance. Keeping xDoctor updated to the most current version enables Dell EMC Customer Support
to more quickly detect and resolve issues with your ECS Appliance.

The latest version is always available using the "xdoctor --upgrade –auto --now" option if the customer's ECS
system can establish a connection to ftp.emc.com. If not, the latest version can be downloaded via
dell.com/support (ECS Appliance / Drivers & Downloads / Category=Product Tool).

3.1 sudo xdoctor -h

sudo xdoctor - h

3.2 Search for xDoctor rpm on Dell support site


https://www.dell.com/support/home/en-us/product-support/product/ecs-appliance-software- with-
encryption/drivers

Troubleshooting Guide

Page 55 of 87
Using the UI

Find a download for your ECS Appliance Software with Encryption

3.3 Download latest version (direct link) (v68 as of Nov 2020):


https://dl.dell.com/downloads/DL97688_xDoctor4ECS-4.8-68.rpm

Contact Dell Customer Service if you cannot access the above link.

3.4 Upgrade to the version in questions via:


sudo xdoctor --upgrade --local=/home/admin/xDoctor4ECS-4.8-68.noarch.rpm

sudo xdoctor --upgrade --local=/home/admin/xDoctor4ECS-4.8-68.noarch.rpm

3.4.1 sudo xdoctor -s (this checks the version)

sudo xdoctor -s

Troubleshooting Guide

Page 56 of 87
Using the UI

3.4.2 sudo xdoctor (this runs a standard health check on the rack in question)

sudo xdoctor

sudo xdoctor

Troubleshooting Guide

Page 57 of 87
Using the UI

3.4.3 sudo <Session Report> -CEW (this prints the Critical/Error/Warning messages
of the report in question)

sudo <Session Report> -CEW

3.5 How do I configure xDoctor to send xDoctor Reports to Customers via Email?
Please follow the steps below.

admin@provo-yellow:~> sudo xdoctor --config

┌────────────────────────────┐

│ xDoctor Configuration Menu │

└───┬────────────────────────┘

┌───┼──────────┐

│ 1 │ Overview │

└───┼──────────┘

┌───┼────────────────────┐

│ 2 │ Reports and Events │

└───┼────────────────────┘

┌───┼─────────────┐

│ 3 │ Auto Update │

└───┼─────────────┘

┌───┼────────────────┐

│ 4 │ Data Scrubbing │

└───┼────────────────┘

Troubleshooting Guide

Page 58 of 87
Using the UI

┌───┼─────────────────────┐

│ 5 │ ECS API Credentials │

└───┼─────────────────────┘

┌───┼───────────────┐

│ 6 │ IPMI Analysis │

└───┼───────────────┘

┌───┼──────┐

│ 0 │ Exit │

└───┴──────┘

Please make a choice: 2

┌────────────────────────────┐

│ xDoctor Reports and Events │

└───┬────────────────────────┘

┌───┼───────────────────────────────┐

│ 1 │ Reports and Events to DellEMC │

└───┼───────────────────────────────┘

│ Status = Enabled

│ Channel = SMTP via SRS

└┬─

│ SRS 1 ID = e7ec9fbb-d0ae-4e09-a192-06b9aa8ce2d8

│ SRS 1 Host = IP_ADDRESS

│ SRS 1 Port = 9443

│ SRS 1 State = CONNECTED

│ SRS 1 Msg = Communication with srs succeeds

│ SRS 1 S/N = SERIAL_NUMBER

Troubleshooting Guide

Page 59 of 87
Using the UI

┌───┬┴───────────────────┐

│ 2 │ Events to Customer │

└───┼────────────────────┘

│ Status = Disabled

└┐

┌───┬┴──────────┐

│ 0 │ Main Menu │

└───┴───────────┘

Please make a choice: 2

Send xDoctor Events to Customer? [No]: Yes

Email Recipient (single) []: (single) []: [email protected] <- Enter customer's email address or mailing list
here

Add another Recipient? [No]:

Recipient (1): [email protected]

Dedicated SMTP Server [Server_name or IP_address:port] []:

(single) []: earth.sol.galaxy:25 <- Enter customer's SMTP server here

Email From [[email protected]]:

Enable TLS? [No]:

Do you want to use a fixed subject? [No]:

Do you want to use a subject prefix? [No]:

Do you want to use a subject suffix? [No]:

Send xDoctor Events to Customer = Yes

|- Recipients = [email protected]

|- SMTP Server = earth.sol.galaxy:25

|- TLS = False

|- Email From = [email protected]

> Issue new Settings? [No]: Yes

Troubleshooting Guide

Page 60 of 87
Using the UI

• Contact Dell EMC Customer Service (i.e. create a SR) for any “Critical” or “Error” messages
that cannot be resolved/require more in-depth investigation. “Warning” messages do not typically
need any attention.
• xDoctor Release Notes (version 68):
• https://dl.dell.com/content/docu97687_xDoctor_ReleaseNotes_4.8-68.pdf?language=en_US

Troubleshooting Guide

Page 61 of 87
Using the UI

4 How to configure ECS to send required information to syslog


Overview:

ECS Syslog (as a fabric application container) supports forwarding of the alerts and audit messages to one or
multiple remote syslog servers.

Alerts and audit messages are from system (host OS), Agent, Lifecycle, Registry, Zookeeper, Object services
to the fabric-syslog container via UDP socket (9154).

Rsyslog server must be configured to forward messages to the predefined localhost port (UDP 9154). No
extra configuration step is required for ECS OS (ECS appliance, ECS certified SD). For ECS custom SD
(DIY), customer is responsible to configure a syslog service on the node.

Customers are responsible for configuring their Syslog servers in order to receive alerts from the ECS. Please
refer Customer viewable KB in Reference section below which has sample setup from one of the internal Dell
EMC labs (Article Number: 000012004).

In this document, we will show how to use different configurations on ECS to send either complete log or part
of the log file based on some condition or summary produced from a script

Script can be downloaded from the below link.

https://object.ecstestdrive.com/ecstsguide/checkResponseCode.pl?X-Amz-
Algorithm=AWS4-HMAC-SHA256&X-Amz-
Credential=132657591476211228%40ecstestdrive.emc.com%2F20210630%2FNone%2Fs3%2Faw
s4_request&X-Amz-Date=20210630T065024Z&X-Amz-Expires=99999&X-Amz-
SignedHeaders=host&X-Amz-
Signature=473865e2d853d04420e47d5a81953031d2b1609cf946571135de212996ea2bac

At a high level it involves 6 steps:

1. Get syslog details

a. Get SYSLOG Server IP/Port from config-syslog.conf OR


b. Get SYSLOG Server IP/Port/protocol from fabric

2. Prepare the configuration file.

a. Send complete dataheadsvc-access.log file


b. Send just 500 errors from dataheadsvc-access.log file
c. Forward only “Connection Limit (1000) reached” from dataheadsvc.log file
d. Send summary of response code (200,403,404,500,503) every xx minutes

3. Distribute the configuration file to all nodes.


4. Restart rsyslog on all nodes for the changes to take effect.
5. Update the MOTD on all nodes to include config files (This is needed when NR is performed to restore
config file)
6. Monitor the log files are receiving the logs/summary on syslog server.

IMPORTANT NOTE: When a NODE REPLACEMENT is performed, review MOTD and copy back these files
back in place post Node replacement procedure.

Troubleshooting Guide

Page 62 of 87
Using the UI

Details steps:

1. Get syslog details

a. Get SYSLOG Server IP/Port from config-syslog.conf

Command#cat
/opt/emc/caspian/fabric/agent/services/fabric/syslog/host/files/config-
syslog.conf

Example :
admin@orem-malachite:~>
cat/opt/emc/caspian/fabric/agent/services/fabric/syslog/host/files/config-
syslog.conf
$ModLoad imudp
$UDPServerAddress 127.0.0.1
$UDPServerRun 9514

*.info @10.247.200.80:514
*.info @10.247.200.85:514
admin@orem-malachite:~>

b. Get SYSLOG Server IP/Port/protocol from fabric

Command# /opt/emc/caspian/fabric/cli/bin/fcli lifecycle


alert.getremotesyslogserverslist

Example:

admin@orem-malachite:~> /opt/emc/caspian/fabric/cli/bin/fcli lifecycle


alert.getremotesyslogserverslist
{
"remote_syslog_server_map" : [
{
"id" : "4fd10e15-9ce9-4dbe-a7a4-0e78ddbad557",
"remote_syslog_server" : {
"protocol" : "UDP",
"port" : 514,
"severity" : "info",
"server" : "10.247.200.80"
}
},
{
"id" : "9e8f6ad3-167f-4c73-8004-be485992caec",
"remote_syslog_server" : {
"protocol" : "UDP",
"port" : 514,
"severity" : "info",
"server" : "10.247.200.85"
}
}

Troubleshooting Guide

Page 63 of 87
Using the UI

],
"status" : "OK",
"etag" : 5410
}
admin@orem-malachite:~>

2. Prepare the configuration file.

a. Send complete dataheadsvc-access.log file (i.e all error codes including 200 OK). This can also
be achieved using svc_request (Refer Article Number: 000020726 in References section).

admin@orem-malachite:~> cat /etc/rsyslog.d/push-dataheadsvc-access-log.conf


#$DebugFile /home/admin/rsyslog.debug
#$DebugLevel 2

module(load="imfile" mode="polling" PollingInterval="10")


ruleset(name="ecss3accesslogs") {
action(type="omfwd" Target="10.247.200.80" Port="514" Protocol="udp")
stop
}
input(type="imfile" ruleset="ecss3accesslogs"
File="/opt/emc/caspian/fabric/agent/services/object/main/log/dataheadsvc-
access.log"
Tag="ecss3"
Severity="info"
Facility="local7")
admin@orem-malachite:~>

b. Send just 500 errors from dataheadsvc-access.log file

admin@provo-malachite:~> cat /etc/rsyslog.d/report-500.conf


#$DebugFile /home/admin/rsyslog.debug
#$DebugLevel 2

module(load="imfile" PollingInterval="5") #needs to be done just once

ruleset(name="ecsaccesslogs500errors") {
if ( $msg contains "HTTP/1.1 404" ) then
{action(type="omfwd" Target="10.247.200.80" Port="514" Protocol="udp")
stop
}
}
input(type="imfile" ruleset="ecsaccesslogs500errors"
File="/opt/emc/caspian/fabric/agent/services/object/main/log/dataheadsvc-
access.log"
Tag="ecs"
Severity="info"
Facility="local7"
StateFile="ecs500tosyslog")
admin@provo-malachite:~>

c. Forward only “Connection Limit (1000) reached” from dataheadsvc.log file

Troubleshooting Guide

Page 64 of 87
Using the UI

admin@sandy-malachite:~> cat /etc/rsyslog.d/report-connlimitreached.conf


#$DebugFile /home/admin/rsyslog.debug
#$DebugLevel 2

module(load="imfile" PollingInterval="5") #needs to be done just once

ruleset(name="ecsconnectionlimit") {
if ( $msg contains "Connection Limit(1000) reached" ) then
{action(type="omfwd" Target="10.247.200.80" Port="514" Protocol="udp")
stop
}
}
input(type="imfile" ruleset="ecsconnectionlimit"

File="/opt/emc/caspian/fabric/agent/services/object/main/log/dataheadsvc.log"
Tag="ecs"
Severity="info"
Facility="local7"
StateFile="ecsconnlimittosyslog")
admin@sandy-malachite:~>

d. Send summary of response code (200,403,404,500,503) every xx minutes

admin@sandy-malachite:~> cat /etc/rsyslog.d/report-errorcode-summary.conf


#$DebugFile /home/admin/rsyslog.debug
#$DebugLevel 2

module(load="imfile" PollingInterval="1") #needs to be done just once

ruleset(name="ecserrorcode_summary") {
action(type="omfwd" Target="10.247.200.80" Port="514" Protocol="udp")
stop
}
input(type="imfile" ruleset="ecserrorcode_summary"

File="/opt/emc/caspian/fabric/agent/services/object/main/log/dh_responsecode_sum
mary.15minout"
Tag="ecs"
Severity="info"
Facility="local7"
StateFile="ecserrorcodesummary")
admin@sandy-malachite:~>

The file dh_responsecode_summary.15minout is generated by a script.

3. Distribute the configuration file to all nodes.

Command#viprscp -f ~/MACHINES /etc/rsyslog.d/push-dataheadsvc-access-log.conf


/etc/rsyslog.d/push-dataheadsvc-access-log.conf

Note: Please ensure ~/MACHINES has all node private.4 IPs from all Racks of the VDC.

Troubleshooting Guide

Page 65 of 87
Using the UI

4. Restart rsyslog on all nodes for the changes to take effect.

Command# viprexec -f ~/MACHINES -i "systemctl restart rsyslog; sleep 5;


systemctl status rsyslog | grep Active; sleep 60"

Note: Please ensure ~/MACHINES has all node private.4 IPs from all Racks of the VDC.

5. Update the MOTD on all nodes to include config files and how to restore (This is needed when NR is
performed to restore config file)

IMPORTANT NOTE: When a NODE REPLACEMENT is performed, review MOTD and copy back these files
back in place post Node replacement procedure.

admin@orem-malachite:~> viprexec -i 'echo -e "\nCustom /etc/rsyslog.d/xxx.conf


is configured to send log to Customer syslog server.\nPlease restore and restart
rsyslog in case of NR(Node Repalcement)\nReference SR#/JIRA#\n" >> /etc/motd';
admin@orem-malachite:~> viprexec -i 'cat /etc/motd'
*
Output from host : 192.168.219.4
Custom /etc/rsyslog.d/xxx.conf is configured to send log to Customer syslog
server.
Please restore and restart rsyslog in case of NR(Node Repalcement)
Reference SR#/JIRA#
admin@orem-malachite:~>

6. Monitor the log files are receiving the logs/summary on syslog server.

a. Monitor complete dataheadsvc-access.log file (i.e all error codes including 200 OK)

nile1-vm59:/var/log/10.249.231.37 # tailf syslog.log


May 21 02:28:48 orem-malachite ecss3 2021-05-21 02:28:41,856
0af9e725:1797ac943f5:3088:1 10.249.231.37:9020 10.249.231.35:34446 user1
curl/7.60.0 GET ns1 buck1 buck1 - HTTP/1.1 200 386 - 564 90 - - -
May 21 02:29:48 orem-malachite ecss3 2021-05-21 02:29:45,642
0af9e725:1797ac943f5:308c:1 10.249.231.37:9020 10.249.231.35:34610 user1
curl/7.60.0 GET ns1 buck1 1.pea - HTTP/1.1 200 265 - 1660 156 - - -
May 21 02:29:58 orem-malachite ecss3 2021-05-21 02:29:49,526
0af9e725:1797ac943f5:308f:1 10.249.231.37:9020 10.249.231.35:34616 user1
curl/7.60.0 GET ns1 buck1 2.pea - HTTP/1.1 404 17 - 171 9 - - -
b. Monitor 500 errors from dataheadsvc-access.log file (For testing purpose used
404 error)
nile1-vm59:/var/log/10.249.231.35 # tail -1 syslog.log
May 21 02:44:03 provo-malachite ecs 2021-05-21 02:44:03,637
0af9e723:1797aca85f8:3125:1 10.249.231.35:9020 10.249.231.35:52798 user1
curl/7.60.0 GET ns1 buck1 2.pea - HTTP/1.1 404 16 - 171 6 - - -
nile1-vm59:/var/log/10.249.231.35 #

b. Monitor 500 errors from dataheadsvc-access.log file (For testing purpose used 404 error)

nile1-vm59:/var/log/10.249.231.35 # tail -1 syslog.log

Troubleshooting Guide

Page 66 of 87
Using the UI

May 21 02:44:03 provo-malachite ecs 2021-05-21 02:44:03,637


0af9e723:1797aca85f8:3125:1 10.249.231.35:9020 10.249.231.35:52798 user1
curl/7.60.0 GET ns1 buck1 2.pea - HTTP/1.1 404 16 - 171 6 - - -
nile1-vm59:/var/log/10.249.231.35 #

c. Monitor only “Connection Limit(1000) reached” from dataheadsvc.log file

nile1-vm59:/var/log/10.249.231.36 # grep 'Connection Limit' syslog.log |tail -1


May 21 02:31:48 sandy-malachite ecs 2021-05-21T02:32:57,193 [qtp1578673831-1015]
INFO ConnectionLimit.java (line 186) Connection Limit(1000) reached for
[TrafficMetricsNetworkTrafficServerConnector@420d123d\{HTTP/1.1,[http/1.1]}{172.
18.73.81:9020}, TrafficMetricsNetworkTrafficServerConnector@5ab2c5d8\{SSL,[ssl,
http/1.1]}{172.18.73.81:9021}]
nile1-vm59:/var/log/10.249.231.36 #

d. Monitor summary of response code (200,403,404,500,503) every xx minutes :

nile1-vm59:/var/log/10.249.231.36 # tailf syslog.log


May 21 02:31:04 sandy-malachite ecs 2021-05-21 02:31:02,000
ECS_VDC_NAME:vdc#033[0m INFO Count of 200 response code in the last 2 day ago :
17
May 21 02:31:04 sandy-malachite ecs 2021-05-21 02:31:02,000
ECS_VDC_NAME:vdc#033[0m CRITICAL Count of 403 response code in the last 2 day
ago : 5
May 21 02:31:04 sandy-malachite ecs 2021-05-21 02:31:02,000
ECS_VDC_NAME:vdc#033[0m CRITICAL Count of 404 response code in the last 2 day
ago : 6
May 21 02:31:04 sandy-malachite ecs 2021-05-21 02:31:02,000
ECS_VDC_NAME:vdc#033[0m INFO Count of 500 response code in the last 2 day ago :
0
May 21 02:31:04 sandy-malachite ecs 2021-05-21 02:31:02,000
ECS_VDC_NAME:vdc#033[0m INFO Count of 503 response code in the last 2 day ago :
0

Troubleshooting Guide

Page 67 of 87
Using the UI

5 Monitoring ECS from SLACK


This is applicable to those customers who use slack and are interested to get ECS basic health alerts in slack
channel.

This prerequisite for this is a custom Slack app to be created, please visit the slack help center for more
information on how slack app can be created.

High level steps to create Slack App with incoming webhooks

1. Create a new Slack app in the workspace where you want to post messages.
2. From the Features page, toggle Activate incoming webhooks on.
3. Click Add new webhook to workspace.
4. Pick a channel that the app will post to, then click Authorize.
5. Use your incoming webhook URL to post a message to Slack.

Below is the list of checks that are performed in the sample script.

1. ECS version
2. xDr version
3. VDC capacity
4. Directory Table Status
5. KPI summary
6. Any errors/warning reported by xDr.

There are steps in this whole process.

1. SSH to any ECS node in the VDC which needs to be monitored as admin user.

#ssh admin@$IP – where $IP is the ECS node IP address.

2. Download the ECSGetVDCStats.py and upload it the ECS node

https://object.ecstestdrive.com/ecstsguide/ECSGetVDCStats.py?X-Amz-Algorithm=AWS4-HMAC-
SHA256&X-Amz-
Credential=132657591476211228%40ecstestdrive.emc.com%2F20210630%2FNone%2Fs3%2Faws4_reque
st&X-Amz-Date=20210630T064321Z&X-Amz-Expires=99999&X-Amz-SignedHeaders=host&X-Amz-
Signature=9f7b8dc039086dd371659766f669229308e3cd58a7a736bad46b7ab61c46ae30

Ecstsguide

3. Set the ECSGetVDCStats.py script as Cron job. In the below screenshot, the ECSGetVDCStats.py script
is set to run every 2 hours and post result.

#crontab - e

Troubleshooting Guide

Page 68 of 87
Using the UI

4. The script runs and collect the data as Cron job based on how its configured in the cron entry (see
previous step).

5. Collected data is consolidated and formatted by the script.


6. Results posted to the desired slack channel. For posting the results to the desire slack channel, the slack
python API is used and it’s called chat.postMessage(used in the script). API reference guide for chat post:
https://api.slack.com/methods/chat.postMessage

Details of some of Arguments used in the script:

'token': Token of the slack channel which is used to display the ECS data

Steps to get or generate the slack token

All the members of the slack workspace except guests has access to this feature

This is available on all the subscriptions.

More details on slack app token: https://slack.com/intl/en-in/help/articles/215770388-Create-and-regenerate-


API-tokens

'channel': Destination Slack channel where the data is displayed


'type': style of the data to be displayed
'text': data to be displayed in slack

The sample output posted in the slack channel:

Stats for Vdc: vdc1

Troubleshooting Guide

Page 69 of 87
Using the UI

6 ECS Test Drive


One stop stop for testing ECS capabilities without owning ECS. Take our product for a test drive! Quickly
create a cloud storage service account and upload your content, please visit https://portal.ecstestdrive.com/
for more information about ECS Test Drive.

Registration Process

1. Visit https://portal.ecstestdrive.com/
2. Click the button below to get started to get started

portal.ecstestdrive.com

3. Create a new account by filling the form in the next screen.

ECS Test Drive

4. After successful registration, you will receive an email, this email contains a link clicking it will complete
the process.
5. You will be presented with the EULA agreement. Please review the EULA, click a check box to indicate
acceptance, and then hit a submit button.
6. At this point all the provisioning is done i.e. their namespace, namespace management user, and object
users are created and credentials generated.

Troubleshooting Guide

Page 70 of 87
Using the UI

7 ECS repository on GITHUB


https://github.com/EMCECS

7.1 ECSSync
ecs-sync is an open-source tool designed to migrate large amounts of data in parallel. This data can originate
from many different sources.

There are many reasons why you may need to migrate data. Tech refreshes, switching vendors, evacuating
EOL racks. Maybe your application team is starting to embrace the object paradigm and wants existing files to
become objects. Or perhaps you need to move sensitive data out of a public cloud. No matter the reason,
ecs-sync can probably help. It was written specifically to move large amounts of data across the network
while maintaining app association and metadata. With ecs-sync, you can copy an NFS export into an S3
bucket. You can migrate clips from Centera to ECS. You can even zip up an Atmos namespace folder into a
local archive. There are many use-cases it supports.

Using a set of plug-ins that can speak native protocols (file, S3, Atmos and CAS), ecs-sync queries the
source system for objects using CLI-, XML- or IU-configured parameters. It then streams these objects and
their metadata in parallel across the network, transforming/logging them through filters, and writes them to the
target system, updating app/DB references on success. There are many configuration parameters that affect
how it searches for objects and logs/transforms/updates references. See the Full CLI Syntax for more details
on what options are available.

A Note on Support
ecs-sync is an open-source tool. As such, there is no commercial support for its use (any support provided on
github is best-effort and community-based). If you plan on migrating your production data, you should
consider a Dell Professional Services migration package. The Dell PS team have extensive knowledge of
ecs-sync and a migration package comes with the full commercial support of Dell EMC engineering.

7.2 Code Samples


Please visit https://github.com/EMCECS/ecs-samples to explore various code samples for working with ECS
for Python, Java, Net and many other programing languages.

7.3 Mongoose
Mongoose 3.x.x the documentation is available at https://github.com/emc-mongoose/mongoose/wiki

Mongoose is a tool which is initially intended to test ECS performance. It is designed to be used for:

• Load Testing
• Stress Testing
• Soak/Longevity/Endurance Testing
• Volume Testing
• Smoke/Sanity Testing

Mongoose can sustain millions of concurrent connections and millions of operations per second.

Please refer to the deployment page for the details.

Troubleshooting Guide

Page 71 of 87
Using the UI

7.4 Tools
• smart-client-java

o https://github.com/EMCECS/smart-client-java

• python-ecsclient

o https://github.com/EMCECS/python-ecsclient

Troubleshooting Guide

Page 72 of 87
Using the UI

8 Real World Examples

8.1 Performance Related Scenarios

8.1.1 Customer complained about timeouts when reading/writing during a given


time interval
8.1.1.1 Things to check/do:
First thing to check would be any DT down event or service restarts in mentioned time frame.

• Check for DT status using “(OE) DT Status” dashboard in “Advanced Monitoring” section. Make sure to
cover the time range mentioned by user.
• Check for any service restarts in given time range using “(OE) Service Restarts” dashboard in “Advanced
Monitoring” section.

In most of the cases, performance issues are caused by DT related events or service restarts. If a service had
restarted, it would cause certain DTs to go down as well for certain amount of time while the service comes
up. If a service had restarted (mainly dataheadsvc, blobsvc, cm) then that would explain the latency/timeouts
experienced by user at that time. You can mention to user that a service restart event had occurred which
caused performance issues during that time. Please contact DellEMC Support for further help.

8.1.2 Customer complaining of latency issue


8.1.2.1 Things to check/do:
Latency issue is mostly due to memory pressure on ECS object services. In addition to verifying the steps in
first scenario you can verify below:

• Using “Data Access Performance - Overview“dashboard, verify if there was sudden spike in number of
requests in that time. A sudden increase in number of requests may cause memory pressure and lead to
latency issues. Check if the sudden spike is expected and verify same from application end.
• You can also verify if requests are balanced across nodes i.e., all nodes are getting same number of
requests.
• Using “(OE) Processes on Host” dashboard, verify if all resource usage is fine.
• Check for any service restarts.
• Open a ticket with DellEMC Support for further help.

8.1.3 Customer noticed the average write latency has gone up in the last 2 hours.
8.1.3.1 Things to check/do:
Important point to note here that only the write latency has increased but not read, if large files are being
uploaded then it's expected that the time taken to upload large file increase. We can check the transactions
for last 2 hours using svc_request -start "2 hours ago" -stop "now" summary and check if the size of the
objects being uploaded is not very huge. Please see the below screenshot for more details.

Troubleshooting Guide

Page 73 of 87
Using the UI

svc_request

8.2 Object Read/write Related Scenarios

8.2.1 Customer is not able to write and getting HTTP 403, Access Denied error
code.
8.2.1.1 Things to check/do:
HTTP error code 403 means “Access Denied” in most cases. 403 error can be verified using command
“kpi.sh -s –start “X mins ago” shown in cli section. It could be due to multiple reasons, but main things to
verify:

• Check if user has corrected permissions or is using correct credentials to access. Check permissions in
UI->Manage->Buckets, select namespace/bucket, edit Bucket, edit ACL and review user ACL.
• Check for time on client side, if it is in sync with time on ECS nodes.

8.2.2 Customer is not able to write and getting HTTP 403, Method Forbidden error
code
8.2.2.1 Things to check/do:
HTTP error code 403 may indicate “Method Forbidden” error as well. 403 error can be verified using
command “kpi.sh -s –start “X mins ago” shown in cli section. It’s mostly due to quota limit exceeded for the
bucket. Verify below things:

• From UI, check quota limit set for the bucket (UI->Manage->Buckets)
• From UI, check quota limit set for the namespace (UI->Manage->Namespace)
• Check current capacity utilization of bucket using Metering (UI->Monitor->Metering) or using svc_bucket
info <bucket_name>
• Increase quota limit if needed or inform client of usage limit
• Open a case with Dell EMC Support if the limit is not reached but a user is still getting quota limit reached
error.

Bucket Management

Troubleshooting Guide

Page 74 of 87
Using the UI

8.2.3 Customer is not able to read few objects and getting HTTP 404 return code.
8.2.3.1 Things to check/do:
HTTP error code 404 means object is not found on ECS. You can verify below things:

• Run svc_request –on $OBEJCTNAME summary in question and confirm that 404 is returned for GET
operation for this object.
• Check if object was ever written to ECS using application logs.
• Check if last update on object has dmarker (If dmarker is true then it’s a deleted object and 404 is
expected).

svc_request –on $OBEJCTNAME

8.2.4 Customer is not able to delete object HTTP 409 error code was returned.
8.2.4.1 Things to check/do:
When trying to delete an object, if you are getting 409 error, this means that object is under retention period,
and cannot be deleted.

• Verify bucket retention period using ECS REST API: GET /object/bucket/{bucketName}/retention
• Check with bucket owner and modify policy if needed

Note: That retention can be set at Namespace, bucket and object level. The maximum retention value will be
enforced. So, we need to check the retention setting at all the three levels.

8.2.5 Customer was unable to write to ECS.Things to check/do:


Using svc_dt check tool, status of the DT was checked, and it was found that all the DTs were ready. Then
using the kpi.sh -s –start “5 mins ago” script, error report was checked, and it was found that only writes (PUT

Troubleshooting Guide

Page 75 of 87
Using the UI

and POST) were hitting errors but not read (GET). Also, using svc_node tool service status was also checked
and none of the services were restarting.

Since reads were fine and DTs are ready and no service restarts. Capacity was checked using svc_vdc
capacity tool and it was found that there was no free space left and that’s why write were failing.

If the overall used capacity is at 90% then writes are not allowed. Please note that minimum of 3 nodes
whose overall capacity is less than 90% is required for a successful write.

8.2.6 Application reports 501 errors.


8.2.6.1 Things to check/do:
If you run the command kpi.sh -s -start '6 hours ago', it would report 501 errors in the summary report.

In this instance, we found that the application was requesting for logging, requestPayment, tagging, website
from the dataheadsvc.log which are not supported/implemented and hence ECS throws 501 error.

Behavior is expected when the requested functionality is not implemented and its documented in the error
code page. Application should be updated to stop calling those APIs or expect 501 error code from ECS.

Refer “Unsupported S3 API” section in the data access guide -


http://doc.isilon.com/ECS/3.5/DataAccessGuide/GUID-CA0B1CAA-35BA-433D-8EB3-304DB47BE3CC.html

kpi.sh -s -start

8.2.7 Customer reported 500 errors.


8.2.7.1 Things to check/do:
Customer had 5 nodes and due to capacity issue capacity expansion was done. Soon after the node
expansion was complete, customer started seeing 500 errors. kpi.sh -s -start “5 mins ago” script was run to
confirm 500 errors were being logged actively.

The expanded node was on the same ECS software version as others and there were no service restarts or
DT unready issue.

svc_network check all and latest version of xdoctor was run to detect that there was a duplicate IP address in
the network that was causing issues, customer shutdown the VM which was assigned with the same IP
address and after that 500 errors were no longer reported.

Troubleshooting Guide

Page 76 of 87
Using the UI

8.2.8 Customer reported 500 errors.


8.2.8.1 Things to check/do:
kpi.sh -s –start “5 mins ago” was executed to check the error status and found 17% of the error rate (kpi.sh
tool shows the error rate as well).

Using svc_dt check tool, status of the DT was checked, and it was found that all the DTs were ready.

svc_network check all and svc_tso heartbeat reported connection issues to the remote VDC and if the
connection/heartbeat between the VDCs in federation is not working for 15 mins (default but it's configurable)
then Temporary Site Outage (TSO) will be triggered. There was an issue with the switch on customer side
and vendor was engaged to resolve the network issue between the 2 VDCs.

svc_tso summary was run to check the tso status and found TSO condition. Once the network issue was
resolved system came out of TSO. The 500 errors were no longer reported.

8.2.9 One of the customer applications is not able to write to ECS.


8.2.9.1 Things to check:
kpi.sh -s –start “5 mins ago” was executed and found all the requests were successful. End user was
requested to provide any one object name which they were not able to write and the bucket to which it
belongs to.

svc_request –on $OBEJCTNAME summary was run and found no request for this object so kpi.sh -s -bucket
$bucketname was run and found that there were no transactions at all for this bucket.

Further investigation on the load balancer side revealed that there was a network issue at Load balancer
which was causing the issue.

The issue got resolved after the network problem in load balancer was resolved.

8.3 Bucket Related Scenarios

8.3.1 Customer is not able to delete bucket from ECS UI


8.3.1.1 Things to check/do:
Few important things to verify to delete a bucket:

• Make sure bucket is empty. If it’s not, use s3 browser (for a s3 bucket), or any other tool, to delete the
bucket contents first
• Check if user has sufficient permission to delete the bucket

8.3.2 Customer wants to know which bucket is highest on capacity/objects


8.3.2.1 Things to check/do:
Check “Top Buckets” dashboard in Advanced Monitoring. It shows list of buckets (sorted by capacity). The
capacity shown for each bucket is per vdc level i.e., how much data was written in this bucket on this vdc.

Troubleshooting Guide

Page 77 of 87
Using the UI

Top Buckets by Size

You can also view count of objects in each bucket:

Top Buckets by Object Count

You can get similar info using svc_bucket info <bucket_name>, but that’s federation level data, as opposed to
vdc level data in dashboard above.

Troubleshooting Guide

Page 78 of 87
Using the UI

8.4 Metering Related Scenarios

8.4.1 End user complaining discrepancy in bucket utilization


8.4.1.1 Things to check/do:
Using svc_bucket info <bucket_name> get the current object size and objects count. Alternatively, we can get
the same information from UI as well.

• login to ECS UI-->monitoring-->metering page.

Metering Page

Verify if the end user reported size and object count and what ECS is reporting are same, if not then there is a
metering discrepancy which is generally due to the following reasons.

• Incomplete MPU
• High number of non-current object versions.
• Compression of the Data at chunk layer.

Please contact DELLEMC support to for investigation into discrepancy take necessary action to correct the
metering discrepancy.

8.5 RPO/Replication Related Scenarios

8.5.1 ECS UI shows RPO not up to date


8.5.1.1 Things to check/do:
Few important things to verify when RPO is not up to date:

• If RPO is in few seconds/minutes, it maybe that huge amount of data was recently ingested. Wait for
some time for data to be copied, and check RPO again
• Below screenshot from ECS UI shows that RPO is NOT Up to date.

Troubleshooting Guide

Page 79 of 87
Using the UI

Geo Monitoring

• If RPO doesn’t come down and continues to increase, verify the replication network bandwidth b/w VDCs.
• Using svc_replicate summary, check if tasks in geo replication queue are moving. If any node doesn’t
show any activity, it may have a problem.
• Open an SR with Dell EMC Support if RPO continues to show lag.

8.6 UI Related Scenarios

8.6.1 Customer logged in to UI and found a node offline. Also unable to ssh to the
node in question
8.6.1.1 Things to check/do:
• Check System Event Log (SEL) for any CATERR (catastrophic error) or Processor IERR (Internal Error).
Run this command from a good node against the BMC IP/private IPMI of the problematical node:

getrackinfo -v (run this to get the BMC/private IPMI IP of the node in question)

Troubleshooting Guide

Page 80 of 87
Using the UI

getrackinfo -v

sudo ipmitool -I lanplus -H <BMC IP/private IPMI> -U root -P passwd sel elist

sudo ipmitool -I lanplus -H <BMC IP/private IPMI> -U root -P passwd sel elist

If the node cannot be brought back online, please open a SR with Dell EMC.

8.6.2 End user complaining, bucket utilization in UI is not reducing after deleting the
objects.
8.6.2.1 Things to check/do:
As a first step, we will check if the User data and system metadata GC process is enabled. We can check this
from ECS UI.

• Login to ECS UI--> Monitor--> Capacity Utilization--->Garbage Collection(tab)

Troubleshooting Guide

Page 81 of 87
Using the UI

Capacity Utilization

The high level deletes workflow in ECS is described below.

Delete Work Flow

The garbage reclaim rate can be checked, using svc_gc rates reclaim.

Using svc_gc stats repo and svc_gc stats btree, ensure the reclaimable garbage is high (in TBs).

8.6.3 Scenario 19: Customer is not able to ssh to the ECS node.
8.6.3.1 Things to check
Ping the ECS nodes from your workstation and verify the network connectivity is fine.

If yes, then login to UI and verify if you can login to ECS UI fine and check if any of the nodes are reported as
offline from UI(Monitor--> System Health-->Offline Nodes).

If no nodes are reported as offline then navigate to Settings--> platform locking and verify if the nodes are
locked, if the nodes are locked from platform then you will not be able to ssh to ECS nodes.

Troubleshooting Guide

Page 82 of 87
Using the UI

Platform Locking

If all the nodes are unlocked, then verify if you can ssh from other workstation.

8.6.4 Scenario 20: Customer reported that ECS is not dialing home.
8.6.4.1 Things to check/do:
Login to ECS UI, in Settings tab verify that ESRS server is reported connected. If it’s not, then verify the
network connectivity between ECS and ESRS server. If it is showing connected, then we can fire a test dial
home alert.

EMC Secure Remote Services Management

If dial home alert is still not received, then please contact DELLEMC support team.

8.7 Object Lifecycle Related Scenarios

8.7.1 Objects not expiring even after setting a life-cycle policy for a bucket.
8.7.1.1 Things to check/do:
Using svc_bucket info <bucket_name> verify that bucket policy is properly applied, like the name of the policy
setup is same as what is applied to this bucket in question also the expiry date in the bucket policy.

If its correct but still the objects are not expired, then please contact DELLEMC support for further
investigation.

Troubleshooting Guide

Page 83 of 87
Using the UI

8.8 Certificated Related Scenarios

8.8.1 Not able to open ECS UI after uploading the certificate to the Mgmt. interface
8.8.1.1 Things to check/do:
If the certificate uploaded is bad due to a new line or space, then after uploading the certificate the nginx
service will fail to restart. Please upload the corrected certificate again using the procedure in ECS admin
guide and that should fix the problem.

We can see the same issue if certificate chain is broken so please get the certificate validated from the
signing authority to make sure the certificate you are uploading to ECS is valid.

Procedure available to validate the certificate using OpenSSL can be used to checking/troubleshooting the
certificate related problems.

8.9 DellEMC Data Domain/ECS Related Scenarios

8.9.1 Not able to Read/Write from DD to ECS


From DD server which is unable read/write from ECS, verify the cloud tier status. If the cloud tier is showing
active or not, see below the ECS_DD_new which was configured with cloud profile ecs_dd_new is reporting
ACTIVE, if so then please retry the read/write operation from cloud and it should succeed.

Please use, kpi.sh (details in the CLI section) script to see if there were any 500 errors for the time when ECS
got disconnected from DD. If so, then these 500 errors may be the reason for disconnecting. Contact support
team for further investigation into the RCA.

Even during DD clout tier cleanup, if for whatever reasons there are 500 errors on ECS during that time then
DD will disconnect from ECS. Please contact support team for further investigation into the error before
retrying the DD cloud tier cleanup again.

8.10 Base url Related Scenarios

8.10.1 New Application reporting 500 error


Example of the 500 error is as below.

169.254.154.1 2021-05-25 11:03:52,630 0af69879:17980be40c1:72d6:3f 10.246.152.121:9020


10.231.82.31:39378 devuser - POST dev cta_10.231.82.31_1621940696_15703_1
cta_10.231.82.31_1621940696_15703_1 uploads HTTP/1.1 500 23 - - 15 - - -

Troubleshooting Guide

Page 84 of 87
Using the UI

We can see that the new application is sending the request with host style addressing and because
appropriate Baseurl was not setup, ECS was interpreting it as path style which led to 500 error.

Please refer to admin guide on how Baseurl should be pre-configured based on how application is going to
send the request.

URL Format

Host Style: http://bucketname.ns1.emc.com/

Path Style: http://ns1.emc.com/bucketname

BaseUrl used in a host-style URL should be pre-configured using the ECS Management API or the ECS
Portal (for example, emc.com in URL: bucketname.ns1.emc.com)

8.11 Erasure Coding Related Scenarios

8.11.1 The storage efficiency is high


Erasure coding(12 data + 4 code) protection mechanism is used to protecting the data, it’s a background
process so if there is lot of data which needs to converted to EC protection from mirror copy then it can have
an impact on the overall storage efficiency.

Login to ECS UI and under Dashboard, data pending to be converted to be EC is reported. There will always
be some data to be ECed in a busy ECS system but if there is a large amount of data pending to be ECed
then please contact support for further investigation.

Storage Efficiency

Troubleshooting Guide

Page 85 of 87
Using the UI

9 Additional Information

ECS Product Support: https://www.dell.com/support/home/en-us/product-support/product/ecs-appliance-/docs

It includes Knowledge Base articles, manuals and documents.

Troubleshooting Guide

Page 86 of 87
Dell Technologies Confidential Information version: 2.3.6.91

Page 87 of 87

You might also like