IBM - IBM Storage Scale 5.1.9 Problem Determination Guide (2024)
IBM - IBM Storage Scale 5.1.9 Problem Determination Guide (2024)
5.1.9
IBM
SC28-3476-02
Note
Before using this information and the product it supports, read the information in “Notices” on page
925.
This edition applies to Version 5 release 1 modification 9 of the following products, and to all subsequent releases and
modifications until otherwise indicated in new editions:
• IBM Storage Scale Data Management Edition ordered through Passport Advantage® (product number 5737-F34)
• IBM Storage Scale Data Access Edition ordered through Passport Advantage (product number 5737-I39)
• IBM Storage Scale Erasure Code Edition ordered through Passport Advantage (product number 5737-J34)
• IBM Storage Scale Data Management Edition ordered through AAS (product numbers 5641-DM1, DM3, DM5)
• IBM Storage Scale Data Access Edition ordered through AAS (product numbers 5641-DA1, DA3, DA5)
• IBM Storage Scale Data Management Edition for IBM® ESS (product number 5765-DME)
• IBM Storage Scale Data Access Edition for IBM ESS (product number 5765-DAE)
• IBM Storage Scale Backup ordered through Passport Advantage® (product number 5900-AXJ)
• IBM Storage Scale Backup ordered through AAS (product numbers 5641-BU1, BU3, BU5)
• IBM Storage Scale Backup for IBM® Storage Scale System (product number 5765-BU1)
Significant changes or additions to the text and illustrations are indicated by a vertical line (|) to the left of the change.
IBM welcomes your comments; see the topic “How to send your comments” on page xlii. When you send information
to IBM, you grant IBM a nonexclusive right to use or distribute the information in any way it believes appropriate without
incurring any obligation to you.
© Copyright International Business Machines Corporation 2015, 2024.
US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with
IBM Corp.
Contents
Tables..................................................................................................................xv
Summary of changes..........................................................................................xliii
iii
Other information about mmpmon output........................................................................................103
Using the performance monitoring tool.................................................................................................. 105
Configuring the performance monitoring tool................................................................................... 106
Starting and stopping the performance monitoring tool...................................................................153
Restarting the performance monitoring tool.....................................................................................154
Configuring the metrics to collect performance data....................................................................... 154
Removing non-detectable resource identifiers from the performance monitoring tool database..154
Measurements....................................................................................................................................156
Viewing and analyzing the performance data.........................................................................................158
Performance monitoring using IBM Storage Scale GUI.................................................................... 158
Viewing performance data with mmperfmon.................................................................................... 170
Using IBM Storage Scale performance monitoring bridge with Grafana..........................................176
iv
File system performance information............................................................................................... 222
Storage pool information................................................................................................................... 223
Disk status information...................................................................................................................... 223
Disk configuration information.......................................................................................................... 224
Disk performance information........................................................................................................... 224
Net-SNMP traps..................................................................................................................................225
Chapter 11. Monitoring the IBM Storage Scale system by using call home............227
Uploading custom files using call home................................................................................................. 227
v
Time stamp in GPFS log entries.........................................................................................................255
Logs.....................................................................................................................................................256
Setting up core dumps on a client RHEL or SLES system................................................................. 281
Configuration changes required on protocol nodes to collect core dump data............................... 282
Setting up an Ubuntu system to capture crash files......................................................................... 283
Trace facility....................................................................................................................................... 283
Collecting diagnostic data through GUI.................................................................................................. 294
CLI commands for collecting issue details............................................................................................. 295
Using the gpfs.snap command.......................................................................................................295
mmdumpperfdata command.............................................................................................................310
mmfsadm command.......................................................................................................................... 312
Commands for GPFS cluster state information.................................................................................312
GPFS file system and disk information commands...........................................................................317
Collecting details of the issues from performance monitoring tools..................................................... 330
Other problem determination tools........................................................................................................ 331
vi
Installation of gpfs.gpfsbin reports an error........................................................................................... 356
GPFS application calls............................................................................................................................. 356
Error numbers specific to GPFS applications calls........................................................................... 357
GPFS modules cannot be loaded on Linux............................................................................................. 357
GPFS daemon issues............................................................................................................................... 358
GPFS daemon does not come up.......................................................................................................358
GPFS daemon went down..................................................................................................................361
GPFS commands are unsuccessful......................................................................................................... 362
GPFS error messages for unsuccessful GPFS commands................................................................ 364
Quorum loss.............................................................................................................................................364
CES configuration issues......................................................................................................................... 365
Application program errors..................................................................................................................... 365
GPFS error messages for application program errors.......................................................................366
Windows issues....................................................................................................................................... 366
Home and .ssh directory ownership and permissions...................................................................... 366
Problems running as Administrator...................................................................................................367
GPFS Windows and SMB2 protocol (CIFS serving)...........................................................................367
vii
The remote cluster name does not match the cluster name supplied by the
mmremotecluster command.....................................................................................................389
Contact nodes down or GPFS down on contact nodes..................................................................... 389
GPFS is not running on the local node...............................................................................................390
The NSD disk does not have an NSD server specified, and the mounting cluster does not have
direct access to the disks............................................................................................................. 390
The cipherList option has not been set properly...............................................................................390
Remote mounts fail with the "permission denied" error message...................................................391
Unable to determine whether a file system is mounted........................................................................ 391
GPFS error messages for file system mount status.......................................................................... 391
Multiple file system manager failures..................................................................................................... 391
GPFS error messages for multiple file system manager failures......................................................392
Error numbers specific to GPFS application calls when file system manager appointment fails... 392
Discrepancy between GPFS configuration data and the on-disk data for a file system........................392
Errors associated with storage pools, filesets and policies................................................................... 393
A NO_SPACE error occurs when a file system is known to have adequate free space.................... 393
Negative values occur in the 'predicted pool utilizations', when some files are 'ill-placed'............ 395
Policies - usage errors........................................................................................................................395
Errors encountered with policies.......................................................................................................396
Filesets - usage errors....................................................................................................................... 397
Errors encountered with filesets....................................................................................................... 398
Storage pools - usage errors..............................................................................................................398
Errors encountered with storage pools............................................................................................. 399
Snapshot problems..................................................................................................................................400
Problems with locating a snapshot....................................................................................................400
Problems not directly related to snapshots...................................................................................... 400
Snapshot usage errors....................................................................................................................... 400
Snapshot status errors.......................................................................................................................401
Snapshot directory name conflicts....................................................................................................402
Errors encountered when restoring a snapshot................................................................................ 402
Failures using the mmbackup command................................................................................................ 403
GPFS error messages for mmbackup errors..................................................................................... 403
IBM Storage Protect error messages................................................................................................ 403
Data integrity........................................................................................................................................... 404
Error numbers specific to GPFS application calls when data integrity may be corrupted...............404
Messages requeuing in AFM....................................................................................................................404
NFSv4 ACL problems............................................................................................................................... 405
viii
Strict replication.................................................................................................................................423
No replication..................................................................................................................................... 423
GPFS error messages for disk media failures................................................................................... 424
Error numbers specific to GPFS application calls when disk failure occurs.................................... 424
Persistent Reserve errors........................................................................................................................ 425
Understanding Persistent Reserve.................................................................................................... 425
Checking Persistent Reserve............................................................................................................. 426
Clearing a leftover Persistent Reserve reservation........................................................................... 426
Manually enabling or disabling Persistent Reserve.......................................................................... 428
GPFS is not using the underlying multipath device................................................................................ 428
Kernel panics with the message "GPFS deadman switch timer has expired and there are still
outstanding I/O requests"..................................................................................................................429
ix
Authenticating the object service...................................................................................................... 464
Authenticating or using the object service........................................................................................ 464
Accessing resources.......................................................................................................................... 465
Connecting to the object services..................................................................................................... 465
Creating a path................................................................................................................................... 466
Constraints for creating objects and containers............................................................................... 466
The Bind password is used when the object authentication configuration has expired..................467
The password used for running the keystone command has expired or is incorrect.......................467
The LDAP server is not reachable......................................................................................................468
The TLS certificate has expired..........................................................................................................468
The TLS CACERT certificate has expired........................................................................................... 469
The TLS certificate on the LDAP server has expired......................................................................... 469
The SSL certificate has expired..........................................................................................................470
Users are not listed in the OpenStack user list................................................................................. 470
The error code signature does not match when using the S3 protocol............................................471
The swift-object-info output does not display........................................................................ 471
Swift PUT returns the 202 error and S3 PUT returns the 500 error due to the missing time
synchronization............................................................................................................................. 472
Unable to generate the accurate container listing by performing a GET operation for unified file
and object access container......................................................................................................... 473
Fatal error in object configuration during deployment..................................................................... 473
Object authentication configuration fatal error during deployment.................................................474
Unrecoverable error in object authentication during deployment .................................................. 474
x
Tracing the mmpmon command........................................................................................................499
xi
Recovering cluster configuration by using CCR...................................................................................... 545
Recovering from a single quorum or non-quorum node failure........................................................545
Recovering from the loss of a majority of quorum nodes................................................................. 546
Recovering from damage or loss of the CCR on all quorum nodes...................................................549
Recovering from an existing CCR backup.......................................................................................... 551
Repair of cluster configuration information when no CCR backup is available..................................... 552
Repair of cluster configuration information when no CCR backup information is available:
mmsdrrestore command........................................................................................................... 553
xii
IBM and accessibility...............................................................................................................................923
Notices..............................................................................................................925
Trademarks.............................................................................................................................................. 926
Terms and conditions for product documentation................................................................................. 926
Glossary............................................................................................................ 929
Index................................................................................................................ 937
xiii
xiv
Tables
2. Conventions................................................................................................................................................. xli
3. System health monitoring options that are available in IBM Storage Scale GUI........................................ 1
4. Notification levels.......................................................................................................................................... 3
5. Notification levels.......................................................................................................................................... 4
13. Keywords and values for the mmpmon nlist add response.....................................................................68
14. Keywords and values for the mmpmon nlist del response......................................................................69
15. Keywords and values for the mmpmon nlist new response.................................................................... 70
21. Keywords and values for the mmpmon rhist off response...................................................................... 81
xv
24. Keywords and values for the mmpmon rhist reset response.................................................................. 85
28. Keywords and values for the mmpmon rpc_s size response...................................................................91
31. Resource types and the sensors responsible for them......................................................................... 155
41. AFM to cloud object storage states and their description..................................................................... 205
xvi
49. gpfsDiskStatusTable: Disk status information....................................................................................... 223
67. Common questions in AFM to cloud object storage with their resolution............................................ 517
xvii
74. Events for the Callhome component...................................................................................................... 570
xviii
99. Events for the SMB component.............................................................................................................. 687
xix
xx
About this information
This edition applies to IBM Storage Scale version 5.1.9 for AIX®, Linux®, and Windows.
IBM Storage Scale is a file management infrastructure, based on IBM General Parallel File System (GPFS)
technology, which provides unmatched performance and reliability with scalable access to critical file
data.
To find out which version of IBM Storage Scale is running on a particular AIX node, enter:
lslpp -l gpfs\*
To find out which version of IBM Storage Scale is running on a particular Linux node, enter:
rpm -qa | grep gpfs (for SLES and Red Hat Enterprise Linux)
To find out which version of IBM Storage Scale is running on a particular Windows node, open Programs
and Features in the control panel. The IBM Storage Scale installed program name includes the version
number.
Which IBM Storage Scale information unit provides the information you need?
The IBM Storage Scale library consists of the information units listed in Table 1 on page xxii.
To use these information units effectively, you must be familiar with IBM Storage Scale and the AIX,
Linux, or Windows operating system, or all of them, depending on which operating systems are in use at
your installation. Where necessary, these information units provide some background information relating
to AIX, Linux, or Windows. However, more commonly they refer to the appropriate operating system
documentation.
Note: Throughout this documentation, the term "Linux" refers to all supported distributions of Linux,
unless otherwise specified.
• mmcrnsd command
• mmcrsnapshot command
• mmdefedquota command
• mmdefquotaoff command
• mmdefquotaon command
• mmdefragfs command
• mmdelacl command
• mmdelcallback command
• mmdeldisk command
• mmdelfileset command
• mmdelfs command
• mmdelnode command
• mmdelnodeclass command
• mmdelnsd command
• mmdelsnapshot command
• mmdf command
• mmdiag command
• mmdsh command
• mmeditacl command
• mmedquota command
• mmexportfs command
• mmfsck command
• mmfsckx command
• mmfsctl command
• mmgetacl command
• mmgetstate command
• mmhadoopctl command
• mmhdfs command
• mmhealth command
• mmimgbackup command
• mmimgrestore command
• mmimportfs command
• mmkeyserv command
• mmlsfileset command
• mmlsfs command
• mmlslicense command
• mmlsmgr command
• mmlsmount command
• mmlsnodeclass command
• mmlsnsd command
• mmlspolicy command
• mmlspool command
• mmlsqos command
• mmlsquota command
• mmlssnapshot command
• mmmigratefs command
• mmmount command
• mmnetverify command
• mmnfs command
• mmnsddiscover command
• mmobj command
• mmperfmon command
• mmpmon command
• mmprotocoltrace command
• mmpsnap command
• mmputacl command
• mmqos command
• mmquotaoff command
• mmquotaon command
• mmreclaimspace command
• mmremotecluster command
• mmremotefs command
• mmrepquota command
• mmrestoreconfig command
• mmrestorefs command
• mmrestrictedctl command
• mmrestripefile command
• mmsnapdir command
• mmstartup command
• mmstartpolicy command
• mmtracectl command
• mmumount command
• mmunlinkfileset command
• mmuserauth command
• mmwatch command
• mmwinservctl command
• mmxcp command
• spectrumscale command
Programming reference
• IBM Storage Scale Data
Management API for GPFS
information
• GPFS programming interfaces
• GPFS user exits
• IBM Storage Scale management
API endpoints
• Considerations for GPFS
applications
IBM Storage Scale: Big Data Cloudera HDP 3.X • System administrators of IBM
and Analytics Guide Storage Scale systems
• Planning
• Application programmers who are
• Installation
experienced with IBM Storage
• Upgrading and uninstallation Scale systems and familiar with
• Configuration the terminology and concepts in
the XDSM standard
• Administration
• Limitations
• Problem determination
Open Source Apache Hadoop
• Open Source Apache Hadoop
without CES HDFS
• Open Source Apache Hadoop with
CES HDFS
IBM Storage Scale Data This guide provides the following • System administrators of IBM
Access Service information: Storage Scale systems
• Overview • Application programmers who are
• Architecture experienced with IBM Storage
Scale systems and familiar with
• Security the terminology and concepts in
• Planning the XDSM standard
• Installing and configuring
• Upgrading
• Administering
• Monitoring
• Collecting data for support
• Troubleshooting
• The mmdas command
• REST APIs
Table 2. Conventions
Convention Usage
bold Bold words or characters represent system elements that you must use literally,
such as commands, flags, values, and selected menu options.
Depending on the context, bold typeface sometimes represents path names,
directories, or file names.
bold bold underlined keywords are defaults. These take effect if you do not specify a
underlined different keyword.
italic Italic words or characters represent variable values that you must supply.
Italics are also used for information unit titles, for the first use of a glossary term,
and for general emphasis in text.
<key> Angle brackets (less-than and greater-than) enclose the name of a key on the
keyboard. For example, <Enter> refers to the key on your terminal or workstation
that is labeled with the word Enter.
\ In command examples, a backslash indicates that the command or coding example
continues on the next line. For example:
{item} Braces enclose a list from which you must choose an item in format and syntax
descriptions.
[item] Brackets enclose optional items in format and syntax descriptions.
<Ctrl-x> The notation <Ctrl-x> indicates a control character sequence. For example,
<Ctrl-c> means that you hold down the control key while pressing <c>.
item... Ellipses indicate that you can repeat the preceding item one or more times.
| In synopsis statements, vertical lines separate a list of choices. In other words, a
vertical line means Or.
In the left margin of the document, vertical lines indicate technical changes to the
information.
Note: CLI options that accept a list of option values delimit with a comma and no space between
values. As an example, to display the state on three nodes use mmgetstate -N NodeA,NodeB,NodeC.
Exceptions to this syntax are listed specifically within the command.
Summary of changes
for IBM Storage Scale 5.1.9
as updated, February 2024
This release of the IBM Storage Scale licensed program and the IBM Storage Scale library includes the
following improvements. All improvements are available after an upgrade, unless otherwise specified.
• Commands, data types, and programming APIs
• Messages
• Stabilized, deprecated, and discontinued features
AFM and AFM DR-related changes
• AFM DR is supported in a remote destination routing (RDR) environment.
• Added the support of the getOutbandList option for the out-of-band metadata population for a GPFS
backend. For more information, see the mmafmctl command, in the IBM Storage Scale: Command
and Programming Reference Guide.
• AFM online dependent fileset can be created and linked in the AFM DR secondary fileset without
stopping the fileset by using the afmOnlineDepFset parameter. For more information, see the
mmchconfig command, in the IBM Storage Scale: Command and Programming Reference Guide and
the Online creation and linking of a dependent fileset in AFM DR section in the IBM Storage Scale:
Concepts, Planning, and Installation Guide.
• Added sample tools for the AFM external caching to S3 servers in a sample directory.
/usr/lpp/mmfs/samples/pcache/
drwxr-xr-x 3 root root 129 Oct 8 11:45 afm-s3-tests
drwxr-xr-x 2 root root 86 Oct 8 11:45 mmafmtransfer-s3-tests
Cloudkit changes
• Cloudkit adds support for Google Cloud Platform (GCP).
• Cloudkit enhancement to support AWS cluster upgrade.
• Cloudkit enhancement to support for scale out on AWS cluster instances.
Discontinuation of the CES Swift Object protocol feature
• CES Swift Object protocol feature is not supported from IBM Storage Scale 5.1.9 onwards.
• IBM Storage Scale 5.1.8 is the last release that has CES Swift Object protocol.
• IBM Storage Scale 5.1.9 will tolerate the update of a CES node from IBM Storage Scale 5.1.8.
– Tolerate means:
- The CES node will be updated to 5.1.9.
- Swift Object support will not be updated as part of the 5.1.9 update.
- You may continue to use the version of Swift Object protocol that was provided in IBM Storage
Scale 5.1.8 on the CES 5.1.9 node.
- IBM will provide usage and known defect support for the version of Swift Object that was
provided in IBM Storage Scale 5.1.8 until you migrate to a supported object solution that IBM
Storage Scale provides.
• Please contact IBM for further details and migration planning.
File system core improvements
• The dynamic pagepool feature is now available in IBM Storage Scale. The feature adjusts the size of
the pagepool memory dynamically. For more information, see the Dynamic pagepool section in IBM
Storage Scale: Concepts, Planning, and Installation Guide.
• The GPFSBufMgr sensor has been added to the performance monitoring tool. Issue the mmperfmon
config add command to add sensor to IBM Storage Scale 5.1.9. For more information, see
GPFSBufMgr in the GPFS metrics section, in the IBM Storage Scale: Problem Determination Guide.
• Enhanced node expel logic has been added in IBM Storage Scale. The expel logic addresses the
issue of a single node experiencing communication issues resulting in other nodes being expelled
from the cluster.
• The mmxcp command has been updated:
– The enable option:
- a new parameter, --hardlinks, has been added that executes an additional pass through the
source files searching and copying hardlinked files as a single batch.
- two new attributes for the copy-attrs parameter, appendonly and immutable, have been
added which copies the appendonly and immutable attributes, if present.
– The verify option:
- Two new attributes for the check option, appendonly and immutable, have been added that
compare the appendonly and immutable attributes, if present.
Table 3. System health monitoring options that are available in IBM Storage Scale GUI
Option Function
Home Provides overall system health of the IBM Storage
Scale system.
System overview widget in the Monitoring > Displays the number of events that are reported
Dashboard page against each component.
System health events widget in the Monitoring > Provides an overview of the events that are
Dashboard page reported in the system.
Timeline widget in the Monitoring > Dashboard Displays the events that are reported in a particular
page timeframe on the selected performance chart.
Filesets with the largest growth rate last week Displays the filesets with the highest growth rate in
widget in the Monitoring > Dashboard page the last one week.
File system capacity by fileset widget in the Displays the capacity reported per fileset in a
Monitoring > Dashboard page file system. The per fileset capacity data requires
quota enablement at the file system level.
Monitoring > Events Lists the events that are reported in the system.
You can monitor and troubleshoot errors on your
system from the Events page.
Monitoring > Tips Lists the tips that are reported in the system and
allows user to hide or show tips. The tip events
give recommendations to the user to avoid certain
issues that might occur in the future.
Monitoring > Thresholds Lists the events that are raised when certain
thresholds are reached for the data that is
collected through performance monitoring sensors.
For more information, see “Monitoring thresholds
by using GUI” on page 7.
Monitoring > Event Notifications Enables you to configure event notifications to
notify the users about significant event changes
that occur in the system.
Nodes Lists the events that are reported at the node level.
Files > File Systems Lists the events that are reported at the file system
level.
Note: The alerts and Tips icons on the IBM Storage Scale GUI header displays the number of tips and
alerts that are received. It specifies the number and age of events that are triggered. The notifications
disappear when the alert or tip is resolved.
Note: A separate event type with severity "Tip" is also available. Tips are the recommendations that are
given to ensure that you avoid certain issues that might occur in the future. The tip events are monitored
separately in the Monitoring > Tips page of the GUI.
Resolving Event
Some issues can be resolved manually. To resolve events created for such issues, select the event and
then click the Resolve Event option that is available under the Actions menu. On selecting the option,
the mmhealth event resolve command is run to resolve the specific event. You can also right-click
an event and select Resolve Event option from the drop-down menu that appears. On completion of the
task, the status appears in the task window. The complete event thread can be viewed under the detailed
view that you can access by using the View Details option.
The traps for the core IBM Storage Scale and those trap objects are not included in the SNMP notifications
that are configured through the IBM Storage Scale management GUI. For more information on SNMP
traps from the core IBM Storage Scale, see Chapter 10, “GPFS SNMP support,” on page 211.
The following example shows the SNMP event notification that is sent for an SNMP test message:
SNMP MIBs
The SNMP Management Information Base (MIB) is a collection of definitions that define the properties of
the managed objects.
The IBM Storage Scale GUI MIB OID range starts with 1.3.6.1.4.1.2.6.212.10. The OID
range 1.3.6.1.4.1.2.6.212.10.0.1 denotes IBM Storage Scale GUI event notification (trap) and
1.3.6.1.4.1.2.6.212.10.1.x denotes IBM Storage Scale GUI event notification parameters (objects).
Defining thresholds
Use the Create Thresholds option to define user-defined thresholds or to modify the predefined
thresholds. You can use the Use as Template option that is available in the Actions menu to use an
already defined threshold as the template to create a threshold. You can specify the following details in a
threshold rule:
• Metric category: Lists all performance monitoring sensors that are enabled in the system and
thresholds that are derived by performing certain calculations on certain performance metrics. These
derived thresholds are referred as measurements. The measurements category provides the flexibility to
edit certain predefined threshold rules. The following measurements are available for selection:
DataPool_capUtil
Datapool capacity utilization, which is calculated as:
(sum(gpfs_pool_total_dataKB)-sum(gpfs_pool_free_dataKB))/
sum(gpfs_pool_total_dataKB)
DiskIoLatency_read
Average time in milliseconds spent for a read operation on the physical disk. Calculated as:
disk_read_time/disk_read_ios
DiskIoLatency_write
Average time in milliseconds spent for a write operation on the physical disk. Calculated as:
disk_write_time/disk_write_ios
Fileset_inode
Inode capacity utilization at the fileset level. This is calculated as:
(sum(gpfs_fset_allocInodes)-sum(gpfs_fset_freeInodes))/
sum(gpfs_fset_maxInodes)
FsLatency_diskWaitRd
File system latency for the read operations. Average disk wait time per read operation on the IBM
Storage Scale client.
sum(gpfs_fs_tot_disk_wait_rd)/sum(gpfs_fs_read_ops)
FsLatency_diskWaitWr
File system latency for the write operations. Average disk wait time per write operation on the IBM
Storage Scale client.
sum(gpfs_fs_tot_disk_wait_wr)/sum(gpfs_fs_write_ops)
MemoryAvailable_percent
Estimated available memory percentage. Calculated as:
– For the nodes that have less than 40 GB total memory allocation:
(mem_memfree+mem_buffers+mem_cached)/mem_memtotal
– For the nodes that have equal to or greater than 40 GB memory allocation:
(mem_memfree+mem_buffers+mem_cached)/40000000
Filter by Cluster
The values are filtered at the cluster level.
Downsampling None
Specifies how the tested value is computed from
all the available samples in the selected monitoring
interval, if the monitoring interval is greater than
the sensor period:
• None: The values are averaged.
• Sum: The sum of all values is computed.
• Minimum: The minimum value is selected.
• Maximum: the maximum value is selected.
Sensitivity 24 hours
The threshold value is being monitored once in a
day.
Direction Low
When the value that is being monitored goes less
than the threshold limit, the system raises an
event.
Prerequisites
The following criteria must be met to use the health monitoring function on your GPFS cluster:
• Only Linux and AIX nodes are supported.
• All operating systems that are running IBM Storage Scale 5.1.x, including AIX nodes, must have Python
3.6 or later installed.
• All operating systems that are running IBM Storage Scale 5.0.x, including AIX nodes, must have Python
2.7 installed.
• CCR must be enabled.
• The cluster must be able to use the mmhealth cluster show command.
Known limitations
The mmhealth command has the following limitations:
• Only GPFS monitoring is supported on AIX.
• The mmhealth command does not fully monitor Omni-Path connections.
Related concepts
“Monitoring system health by using IBM Storage Scale GUI” on page 1
The IBM Storage Scale system provides background monitoring capabilities to check the health of a
cluster and each node of the cluster, including all the services that are hosted on a node. You can view
the system health states or corresponding events for the selected health state on the individual pages,
widgets or panels of the IBM Storage Scale GUI. You can also view system health details by issuing the
mmhealth command options like mmhealth cluster show, mmhealth node show, or other similar
options.
General
1. CALLHOME
GNR
1. NVMe
• Node role: Node must be either an Elastic Storage Server (ESS) node or an ECE node that is
connected to an NVMe device.
• Task: Monitors the health state of the NVMe devices.
Interface
1. AFM
• Node role: The AFM monitoring service is active if the node is a gateway node.
Note: Users can now create and raise custom events. For more information, see “Creating, raising, and
finding custom defined events” on page 19.
For a list of all the available events, see “Events” on page 559.
Important: The script is started synchronously within the monitoring cycle, therefore it must be
lightweight and return a value quickly. The recommended runtime is less than 1 second. Long running
scripts are detected, logged and killed. The script has a hard timeout of 60 seconds.
{
"event_name_1":{
"cause":"",
"user_action":"",
"scope":"NODE",
"code":"cu_xyz",
"description":"",
"event_type":"INFO",
"message":"",
"severity": ["INFO" | "WARNING" | "ERROR"]
},
"event_name_2":{
[…]
}
2. Restart the health daemon using the systemctl restart mmsysmon.service command.
The daemon does not load the custom.json file if there are duplicated codes in the events. The
daemon status can be checked using the systemctl status mmsysmon.service command.
3. Run the mmhealth event show <event_name> command, where <event_name> is the name of
the custom event.
The system gives output similar to the following:
If the custom.json file was loaded successfully, the command returns the event's information. If the
custom.json file was not loaded successfully, an error message is displayed stating that this event is
not known to the system.
4. Repeat steps 1-3 on all nodes.
5. Restart the GUI node using the systemctl restart gpfsgui.service command to make the
GUI aware of the new events.
6. Run the following command to raise the custom event:
Note: You can raise a new custom event only after you restart the gpfsgui and mmsysmon daemon.
The <arguments> needs to be comma-separated list enclosed in double quotation mark. For
example, “arg1,arg2,…,argN”.
You can use the mmhealth node eventlog command to display the log of when an event was
raised. You can also configure emails to notify users when a custom event is raised. For more
information, see “Event notifications” on page 4.
Predefined thresholds
In a cluster, the following three types of thresholds are predefined and enabled automatically:
• Thresholds monitoring the file system capacity usage
• Thresholds monitoring the memory usage
• Thresholds monitoring the number of SMB connections
MemFree_Rule
The MemFree_Rule is a predefined threshold rule that monitors the free memory usage. The
MemFree_Rule rule observes the memory-free usage on each cluster node and prevents the device
from becoming unresponsive when memory is no longer available.
The memory-free usage rule is evaluated for each node in the cluster. The evaluation status is included in
the node health status of each particular node. For memory usage rule, the warn level is set to 100 MB,
and the error level to 50 MB.
The default value, MemFree_Rule evaluates the estimated available memory in relation to the total
memory allocation. For more information, see the MemoryAvailable_percent measurement definition
in the mmhealth command section in IBM Storage Scale: Command and Programming Reference Guide.
For the new MemFree_Rule, only a WARNING threshold level is defined. The node is tagged with a
WARNING status if the Memfree_util value goes less than 5%.
For the nodes that have greater than or equal to 40 GB of total memory allocation, the available memory
percentage is evaluated against a fixed value of 40 GB. This evaluation prevents the nodes that have more
than 2 GB free memory from sending warning messages.
Note: For IBM Storage Scale 5.0.4, the default MemFree_Rule is replaced automatically. The customer-
created rules remain unchanged.
AFMInQueue_Rule
The AFMInQueue_Rule is a predefined threshold rule that monitors the AFM gateway in-queue memory
usage. The AFMInQueue_Rule value must be set to 40-50% of the available memory on the gateway
node, which is considered to be a dedicated gateway node. If the value of the AFMInQueue_Rule rule is
not defined, then its default value is set to 8GiB.
The AFMInQueue_Rule memory usage rule, as a warning level, is set at 80% of assigned memory, and as
an error level, the memory usage rule is set at 90% of assigned memory. When either of these levels are
reached or exceeded, then an mmhealth event is raised. The mmhealth event can be viewed in the IBM
Storage Scale GUI or on the CLI by using the mmhealth command.
If the mmhealth events are raised, then a user can take the following steps to resolve the issue:
User-defined thresholds
You can create individual thresholds for all metrics that are collected through the performance monitoring
sensors. You can use the mmhealth thresholds add command to create a new threshold rule.
If multiple thresholds rules have overlapping entities for the same metrics, then only one of the
concurrent rules is made actively eligible. All rules get a priority rank number. The highest possible
rank number is one. This rank is based on a metric's maximum number of filtering levels and the filter
granularity that is specified in the rule. As a result, a rule that monitors a specific entity or a set of entities
becomes high priority. This high-priority rule performs entity thresholds evaluation and status update for
a particular entity or a set of entities. This implies that a less specific rule, like the one that is valid for
all entities, is disabled for this particular entity or set of entities. For example, a threshold rule that is
applicable to a single file system takes precedence over a rule that is applicable to several or all the file
systems. For more information, see “Use case 4: Create threshold rules for specific filesets” on page 36.
active_thresholds_monitor: g5160-12d.localnet.com
3. To view the health status of all the nodes, issue the command:
4. To view the detailed health status of the component and its sub-component, issue the command:
5. To view the health status of only unhealthy components, issue the command:
6. To view the health status of sub-components of a node's component, issue the command:
7. To view the eventlog history of the node for the last hour, issue the command:
8. To view the eventlog history of the node for the last hour, issue the command:
9. To view the detailed description of an event, issue the mmhealth event show command. This is an
example for quorum_down event:
2016-09-27 11:31:52.520002 CEST move_cesip_from INFO Address 192.168.3.27 was moved from this node to node 3
2016-09-27 11:32:40.576867 CEST nfs_dbus_ok INFO NFS check via DBus successful
2016-09-27 11:33:36.483188 CEST pmsensors_down ERROR pmsensors service should be started and is stopped
2016-09-27 11:34:06.188747 CEST pmsensors_up INFO pmsensors service as expected, state is started
2016-09-27 11:31:52.520002 CEST cesnetwork move_cesip_from 999244 INFO Address 192.168.3.27 was moved from this node to node 3
2016-09-27 11:32:40.576867 CEST nfs nfs_dbus_ok 999239 INFO NFS check via DBus successful
2016-09-27 11:33:36.483188 CEST perfmon pmsensors_down 999342 ERROR pmsensors service should be started and is stopped
2016-09-27 11:34:06.188747 CEST perfmon pmsensors_up 999341 INFO pmsensors service as expected, state is started
10. To view the detailed description of the cluster, issue the command:
Note: The cluster must have the minimum release level as 4.2.2.0 or higher to use mmhealth
cluster show command. Also, this command does not support Windows operating system.
11. To view more information of the cluster health status, issue the command:
12. To view the state of the file system, issue the command:
mmhealth node show filesystem -v
Node name: ibmnode1.ibm.com
Component Status Status Change Reasons
--------------------------------------------------------------------------------------------------------
FILESYSTEM HEALTHY 2019-01-30 14:32:24 fs_maintenance_mode(gpfs0), unmounted_fs_check(gpfs0)
gpfs0 SUSPENDED 2019-01-30 14:32:22 fs_maintenance_mode(gpfs0), unmounted_fs_check(gpfs0)
objfs HEALTHY 2019-01-30 14:32:22 -
Use case 1: Create a threshold rule and use the mmhealth command to
observe the changed in the HEALTH status
This section describes the threshold use case to create a threshold rule and use the mmhealth
commands to observe the changed in the HEALTH status.
1. To monitor the memory_free usage on each node, create a new thresholds rule with the following
settings:
2. To view the list of all threshold rules defined for the system, run the following command:
3. To show the THRESHOLD status of the current node, run the following command:
4. To view the event log history of the node, run the following command on each node:
# mmhealth node eventlog
2017-03-17 11:52:33.063550 CET thresholds_error ERROR The value of mem_memfree for the component(s)
myTest_memfree/gpfsgui-14.novalocal exceeded
threshold error level 1000000 defined in
myTest_memfree.
5. You can view the actual metric values and compare with the rule boundaries by running the metric
query against the pmcollector node. The following example shows the mem_memfree metric
query command and metric values for each node in the output:
6. To view the THRESHOLD status of all the nodes, run the following command:
7. To view the details of the raised event, run the following command:
8. To get an overview about the node that is reporting an unhealthy status, check the event log for this
node by running the following command:
2017-03-16 12:01:21.389392 CET thresholds_error ERROR The value of mem_memfree for the component(s)
myTest_memfree/gpfsgui-13.novalocal exceeded
threshold error level 1000000 defined in
myTest_memfree.
9. To check the last THRESHOLD event update for this node, run the following command:
10. To review the status of all services for this node, run the following command:
Use case 2: Observe the file system capacity usage by using default threshold
rules
This use case demonstrates the use of mmhealth threshold list command for monitoring a file
system capacity event by using default threshold rules.
Since the file system capacity-related thresholds such as DataCapUtil_Rule,
MetaDataCapUtil_Rule, and InodeCapUtil_Rule are not node-specific. These thresholds are
reported on the node that has active threshold monitor role.
1. Issue the following command to view the node that has active threshold monitor role
and the predefined threshold rules: DataCapUtil_Rule, MetaDataCapUtil_Rule, and
InodeCapUtil_Rule enabled in a cluster.
The preceding command shows output similar to the following as shown here:
active_thresholds_monitor: scale-12.vmlocal
2. ssh to switch to the node that has active threshold monitor role:
The preceding command gives output similar to the following as shown here.
As you can see in the preceding file system example output, everything looks correct except the
"pool-metadata_high_warn" event.
4. Issue the following command to get the "pool-metadata_high_warn" warning details:
The preceding command shows the warning detail similar to the following as shown here.
Tip: See File system events to get complete list of all the possible file system events.
5. Compare the metadata capacity values reported by MetaDataCapUtil_Rule of the system pool from
localFS file system with mmlspool command output.
The preceding command shows the storage pools in file system at '/gpfs/localFS' similar to
following as shown:
Name Id BlkSize Data Meta Total Data in (KB) Free Data in (KB) Total Meta in
(KB) Free Meta in (KB)
system 0 4 MB yes yes 16777216 13320192 ( 79%)
16777216 2515582 ( 15%)
In the preceding output, you can see that the pool system has only 15% available space for
meta_data.
Use case 3: Observe the health status changes for a particular component
based on the specified threshold rules
This use case shows the usage of the mmhealth command to observe the health status changes for a
particular node based on the specified threshold rules.
Run the following command to view the threshold rules that are predefined and enabled automatically in
a cluster:
The default MemFree_Rule rule monitors the estimated available memory in relation to the total memory
allocation on all cluster nodes. A WARNING event is sent for a node, if the MemoryAvailable_percent
mmhealth_thresholds:THRESHOLD_RULE:HEADER:version:reserved:MemFree_Rule:attribute:value:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:rule_name:MemFree_Rule:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:frequency:300:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:tags:thresholds:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:user_action_warn:The estimated available memory is less than 5%, calculated to the total RAM or 40
GB, whichever is lower.:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:user_action_error::
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:priority:2:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:downsamplOp:min:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:type:measurement:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:metric:MemoryAvailable_percent:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:metricOp:noOperation:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:bucket_size:1:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:computation:value:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:duration:n/a:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:filterBy::
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:groupBy:node:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:error:None:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:warn:5.0:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:direction:low:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:hysteresis:5.0:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:sensitivity:300-min:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:state:active:
Note: The MemFree_Rule rule has the same evaluation priority for all nodes.
Run the following command on a node to view the health state of all the threshold rules that are defined
for that node:
In a production environment, in certain cases, the memory availability observation settings need to be
defined for a particular host separately. Follow these steps to set the memory availability for a particular
node:
1. Run the following command to create a new rule, node11_mem_available, to set the
MemoryAvailable_percent threshold value for the node RHEL77-11.novalocal:
2. Run the following command to view all the defined rules on a cluster:
Note:
The node11_mem_available rule has the priority one for the RHEL77-11.novalocal node:
[root@rhel77-11 ~]# mmhealth thresholds list -v -Y | grep node11_mem_available
mmhealth_thresholds:THRESHOLD_RULE:HEADER:version:reserved:node11_mem_available:attribute:value:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:rule_name:node11_mem_available:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:frequency:300:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:tags:thresholds:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:user_action_warn:None:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:user_action_error:None:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:priority:1:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:downsamplOp:None:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:type:measurement:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:metric:MemoryAvailable_percent:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:metricOp:noOperation:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:bucket_size:300:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:computation:None:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:duration:None:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:filterBy:node=rhel77-11.novalocal:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:groupBy:node:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:error:5.0:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:warn:50.0:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:direction:None:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:hysteresis:0.0:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:sensitivity:300:
mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:state:active:
All the MemFree_Rule events are removed for RHEL77-11.novalocal, since the
node11_mem_available rule has the higher priority for this node:
[root@rhel77-11 ~]# mmhealth node show threshold -v
Since the warning boundary by node11_mem_available rule is higher than the MemFree_Rule rule,
the WARNING event might appear faster than before for this node.
[root@rhel77-11 ~]# mmhealth node show threshold -v
You can also review the event history by viewing the whole event log as shown:transfered
[root@rhel77-11 ~]# mmhealth node eventlog
Node name: RHEL77-11.novalocal
Timestamp Event Name Severity Details
2020-04-27 11:59:06.532239 CEST monitor_started INFO The IBM Storage Scale monitoring service has been started
2020-04-27 11:59:07.410614 CEST service_running INFO The service clusterstate is running on node RHEL77-11.novalocal
2020-04-27 11:59:07.784565 CEST service_running INFO The service network is running on node RHEL77-11.novalocal
2020-04-27 11:59:09.965934 CEST gpfs_down ERROR The Storage Scale service process not running on this node.
Normal operation cannot be done
2020-04-27 11:59:10.102891 CEST quorum_down ERROR The node is not able to reach enough quorum nodes/disks to work properly.
Per default, the following rules are defined and enabled in the cluster:
[root@gpfsgui-11 ~]# mmhealth thresholds list
### Threshold Rules ###
rule_name metric error warn direction filterBy groupBy sensitivity
---------------------------------------------------------------------------------------------------------------
InodeCapUtil_Rule Fileset_inode 90 80 None gpfs_cluster_name,
gpfs_fs_name,
gpfs_fset_name 300
DataCapUtil_Rule DataPool_capUtil 90.0 80.0 high gpfs_cluster_name,
gpfs_fs_name,
gpfs_diskpool_name 300
MemFree_Rule mem_memfree 50000 100000 low node 300
MetaDataCapUtil_Rule MetaDataPool_capUtil 90.0 80.0 high gpfs_cluster_name,
gpfs_fs_name,
gpfs_diskpool_name 300
1. Create new inode capacity usage rules for the specific filesets.
a. To create a threshold rule for all filesets in an individual file system, use the following command:
b. To create a threshold rule for individual fileset, use the following command:
Note: In this case, for the nfs_shareFILESET fileset, you have specified both, the file system
name and the fileset name in the filter.
The mmhealth thresholds add command gives an output similar to the following:
2. Run the mmhealth thresholds list command to list the individual rules' priorities. In this
example, the rule_SingleFset_inFS rule has the highest priority for the nfs_shareFILESET
fileset. The rule_ForAllFsets_inFS rule has the highest priority for the other filesets that belong
to the nfs_shareFS file system, and the InodeCapUtil_Rule rule is valid for all the remaining
filesets.
active_thresholds_monitor: gpfsgui-22.novalocal
### Threshold Rules ###
rule_name metric error warn direction filterBy groupBy
sensitivity
--------------------------------------------------------------------------------------------------------------
---
InodeCapUtil_Rule Fileset_inode 90.0 80.0 high gpfs_cluster_name, 300
gpfs_fs_name,
gpfs_fset_name
DataCapUtil_Rule DataPool_capUtil 90.0 80.0 high gpfs_cluster_name, 300
gpfs_fs_name,
gpfs_diskpool_name
MemFree_Rule mem_memfree 50000 100000 low node
300
SMBConnPerNode_Rule connect_count 3000 None high node
300
SMBConnTotal_Rule connect_count 20000 None high
300
MetaDataCapUtil_Rule MetaDataPool_capUtil 90.0 80.0 high gpfs_cluster_name, 300
gpfs_fs_name,
gpfs_diskpool_name
The information about the ACTIVE PERFORMANCE MONITOR node is also included in the THRESHOLD
service health state.
The health status of the active_threshold_monitor for the nodes that have the ACTIVE
PERFORMANCE MONITOR role is shown as a subprocess of the THRESHOLD service.
There are no active error events for the component THRESHOLD on this node
(gpfsgui-21.novalocal).
There are no active error events for the component THRESHOLD on this node
(gpfsgui-22.novalocal).
There are no active error events for the component THRESHOLD on this node
(gpfsgui-23.novalocal).
There are no active error events for the component THRESHOLD on this node
(gpfsgui-24.novalocal).
If the ACTIVE PERFORMANCE MONITOR node loses the connection or is unresponsive, another
pmcollector node takes over the role of the ACTIVE PERFORMANCE MONITOR node. After a new
pmcollector takes over the ACTIVE PERFORMANCE MONITOR role, the status of all the cluster-wide
thresholds is also reported by the new ACTIVE PERFORMANCE MONITOR node.
active_thresholds_monitor: gpfsgui-21.novalocal
### Threshold Rules ###
rule_name metric error warn direction filterBy groupBy sensitivity
-----------------------------------------------------------------------------------------------------------------
InodeCapUtil_Rule Fileset_inode 90.0 80.0 high gpfs_cluster_name,
gpfs_fs_name,
gpfs_fset_name 300
DataCapUtil_Rule DataPool_capUtil 90.0 80.0 high gpfs_cluster_name,
gpfs_fs_name,
gpfs_diskpool_name 300
MemFree_Rule mem_memfree 50000 100000 low node
300
SMBConnPerNode_Rule connect_count 3000 None high node
300
SMBConnTotal_Rule connect_count 20000 None high
300
MetaDataCapUtil_Rule MetaDataPool_capUtil 90.0 80.0 high gpfs_cluster_name,
gpfs_fs_name,
There are no active error events for the component THRESHOLD on this node (gpfsgui-21.novalocal).
There are no active error events for the component THRESHOLD on this node (gpfsgui-22.novalocal).
There are no active error events for the component THRESHOLD on this node (gpfsgui-23.novalocal).
There are no active error events for the component THRESHOLD on this node (gpfsgui-24.novalocal).
There are no active error events for the component THRESHOLD on this node (gpfsgui-25.novalocal).
The ACTIVE PERFORMANCE MONITOR switch over triggers new event entry in the Systemhealth event log:
active_thresholds_monitor: fscc-p8-23-c.mainz.de.ibm.com
### Threshold Rules ###
rule_name metric error warn direction filterBy
groupBy sensitivity
--------------------------------------------------------------------------------------------------------------------------------
----------------------
InodeCapUtil_Rule Fileset_inode 90.0 80.0 high
gpfs_cluster_name,gpfs_fs_name,gpfs_fset_name 300m
DataCapUtil_Rule DataPool_capUtil 90.0 80.0 high
gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name 300
MemFree_Rule MemoryAvailable_percent None 5.0 low
node 300-min
diskIOreadresponseTime DiskIoLatency_read 250 100 None node,
diskdev_name 300
SMBConnPerNode_Rule connect_count 3000 None high
node 300
diskIOwriteresponseTime DiskIoLatency_write 250 100 None node,
diskdev_name 300
SMBConnTotal_Rule connect_count 20000 None
high 300
MetaDataCapUtil_Rule MetaDataPool_capUtil 90.0 80.0 high
gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name 300
All threshold events that are raised until now can also be reviewed by running the following command:
...
2019-08-30 12:40:49.102217 CEST thresh_monitor_set_active INFO The thresholds monitoring process is
running in ACTIVE state on the local node
2019-08-30 12:41:04.092083 CEST thresholds_new_rule INFO Rule diskIOreadresponseTime was added
2019-08-30 12:41:04.127695 CEST thresholds_new_rule INFO Rule SMBConnTotal_Rule was added
2019-08-30 12:41:04.147223 CEST thresholds_new_rule INFO Rule diskIOwriteresponseTime was added
2019-08-30 12:41:19.117875 CEST thresholds_new_rule INFO Rule MemFree_Rule was added
2019-08-30 13:16:04.804887 CEST thresholds_normal INFO The value of DiskIoLatency_read defined in
diskIOreadresponseTime for component
diskIOreadresponseTime/fscc-p8-23-c/sda1
reached a normal level.
2019-08-30 13:16:04.831206 CEST thresholds_normal INFO The value of DiskIoLatency_read defined in
diskIOreadresponseTime for component
diskIOreadresponseTime/fscc-p8-23-c/sda2
reached a normal level.
2019-08-30 13:21:05.203115 CEST thresholds_normal INFO The value of DiskIoLatency_read defined in
diskIOreadresponseTime for component
diskIOreadresponseTime/fscc-p8-23-c/sdc
reached a normal level.
2019-08-30 13:21:05.227137 CEST thresholds_normal INFO The value of DiskIoLatency_read defined in
diskIOreadresponseTime for component
diskIOreadresponseTime/fscc-p8-23-c/sdd
reached a normal level.
2019-08-30 13:21:05.242787 CEST thresholds_normal INFO The value of DiskIoLatency_read defined in
diskIOreadresponseTime for component
diskIOreadresponseTime/fscc-p8-23-c/sde
reached a normal level.
2019-08-30 13:41:06.809589 CEST thresholds_removed INFO The value of DiskIoLatency_read for the component(s)
diskIOreadresponseTime/fscc-p8-23-c/sda1
defined in diskIOreadresponseTime was removed.
2019-08-30 13:41:06.902566 CEST thresholds_removed INFO The value of DiskIoLatency_read for the component(s)
diskIOreadresponseTime/fscc-p8-23-c/sda2
defined in diskIOreadresponseTime was removed.
2019-08-30 15:24:43.224013 CEST thresholds_warn WARNING The value of MemoryAvailable_percent for the
component(s)
MemFree_Rule/fscc-p8-23-c exceeded threshold warning
level
6.0 defined in MemFree_Rule.
2019-08-30 15:24:58.243273 CEST thresholds_normal INFO The value of DiskIoLatency_write defined in
diskIOwriteresponseTime for component
diskIOwriteresponseTime/fscc-p8-23-c/sda3
reached a normal level.
2019-08-30 15:24:58.289469 CEST thresholds_normal INFO The value of DiskIoLatency_write defined in
diskIOwriteresponseTime for component
diskIOwriteresponseTime/fscc-p8-23-c/sda
reached a normal level.
2019-08-30 15:29:43.648830 CEST thresholds_normal INFO The value of MemoryAvailable_percent defined
in MemFree_Rule for component MemFree_Rule/fscc-
p8-23-c
reached a normal level.
...
You can view the mmsysmonitor log located in the /var/adm/ras directory for more specific details
about the events that are raised.
If the mmsysmonitor log is set to DEBUG, and the buffering option of the debug messages is turned off,
the log file must include all the messages about the threshold rule evaluation process.
The estimation of the available memory on the node is based on the free buffers and the cached memory
values. The free buffers and the cached memory values are returned by the performance monitoring tool
that is derived from the /proc/meminfo file. The following queries show how the available memory
percentage value depends on the sample interval. The larger the bucket_size, the more the metrics
values are smoothed.
.
[root@fscc-p8-23-c ~]# date; echo "get metrics mem_memfree,mem_buffers,mem_cached,
mem_memtotal from node=fscc-p8-23-c last 600 bucket_size 1 " | /opt/IBM/zimon/zc localhost
Fri 30 Aug 15:27:02 CEST 2019
1: fscc-p8-23-c|Memory|mem_memfree
The suffix -min in the rule sensitivity parameter prevents the averaging of the metric values. Of all the
data points returned by a metrics sensor for a specified sensitivity interval, the smallest value is involved
in the threshold evaluation process. Use the following command to get all parameter settings of the
default MemFree_rule:
------------------------------------------------------------------------------
rule_name
MemFree_Rule
frequency
300
tags
thresholds
priority
downsamplOpNone
type
measurement
metric
MemoryAvailable_percent
metricOp
noOperation
bucket_size300
computationNone
duration
None
filterBy
groupBy
node
error
None
warn
6.0
directionlow
hysteresis0.0
sensitivity 300-min
Use case 7: Observe the running state of the defined threshold rules
This section describes the threshold use case to observe the running state of the defined threshold rules.
1. To view the list of all the threshold rules that are defined for the system and their running state, run the
following command:
#![allow(non_camel_case_types)]
#![allow(non_snake_case)]
#[derive(Deserialize)]
struct HealthEvent {
cause: String,
code: String,
component: String,
container_restart: bool,
container_unready: bool,
description: String,
entity_name: String,
entity_type: String,
event: String,
event_type: String,
ftdc_scope: String,
identifier: String,
internalComponent: String,
is_resolvable: bool,
message: String,
node: String,
priority: u64,
remedy: Option<String>,
requireUnique: bool,
scope: String,
#[derive(Deserialize)]
struct PostMsg {
version: String,
reportingController: String,
reportingInstance: String,
events: Vec<HealthEvent>,
}
pub fn main() {
let args: Vec<String> = env::args().collect();
if args.len() != 2 {
println!("Usage: {} IPAddr:Port", args[0]);
return;
}
import (
"encoding/json"
"fmt"
"log"
"net/http"
"os"
"time"
)
func main() {
http.HandleFunc("/webhook", webhook)
if len(os.Args) != 2 {
log.Fatal(fmt.Errorf("usage: %s IPAddr:Port", os.Args[0]))
}
fmt.Printf("Starting server listening on %s ...\n", os.Args[1])
if err := http.ListenAndServe(os.Args[1], nil); err != nil {
log.Fatal(err)
}
}
The following example shows starting the GO program that is compiled to a binary named webhook. You
must provide the IP address and port number the HTTP server can use.
#!/usr/bin/env python3
import argparse
from collections import Counter
import cherrypy
import json
class DataView(object):
exposed = True
@cherrypy.tools.accept(media='application/json')
def POST(self):
rawData = cherrypy.request.body.read(int(cherrypy.request.headers['Content-
Length']))
b = json.loads(rawData)
eventCounts = Counter([e['severity'] for e in b['events']])
print(f"{b['reportingInstance']}: {json.dumps(eventCounts)}")
if dump_json: print(json.dumps(b, indent=4))
return "OK"
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--json', action='store_true')
parser.add_argument('IP', type=str)
parser.add_argument('PORT', type=int)
args = parser.parse_args()
conf = {
'/': {'request.dispatch': cherrypy.dispatch.MethodDispatcher(),}
}
dump_json = args.json
cherrypy.server.socket_host = args.IP
Configuring a webhook
In an IBM Storage Scale cluster, you can configure the mmhealth framework to interact with the webhook
server by using the mmhealth config webhook add command. For more information, see mmhealth
command section in IBM Storage Scale: Command and Programming Reference Guide.
For example, run the mmhealth command by using the IP address 192.0.2.48 and port 9000 that was
used when starting up the webhook server.
To list all webhooks that are currently configured, use the command:
You can add the -Y option for extended information and to make the output that is more easily processed
by other programs and scripts:
The –Y output shows the UUID value that is associated with this webhook. The UUID value is set in the
HTTP POST header when the mmhealth command posts to the webhook.
When the webhook is configured in the mmhealth framework, the webhook server starts to receive health
events.
Note: Events are sent only when a health event is triggered in IBM Storage Scale.
Important: If the mmhealth webhook framework faces issues with the configured webhook URL, then
it is disabled over time with a default setting of 24 hours. The -Y output shows the enabled or disabled
status of each webhook URL. If the webhook URL gets disabled, then rerun the mmhealth config
webhook command to readd the URL.
{
"version": "2",
"reportingController": "spectrum-scale",
"reportingInstance": "gpfs-14.localnet.com",
"events": [
{
"cause": "A file system was unmounted.",
"code": "999305",
"component": "filesystem",
"container_restart": false,
"container_unready": false,
"description": "A file system was unmounted.",
"entity_name": "t123fs",
"entity_type": "FILESYSTEM",
"event": "fs_unmount_info",
"event_type": "INFO_EXTERNAL",
"ftdc_scope": "",
"identifier": "t123fs",
"internalComponent": "",
"is_resolvable": false,
"message": "The file system t123fs was unmounted normal.",
"node": "6",
"priority": 99,
"remedy": null,
"requireUnique": true,
"scope": "NODE",
"severity": "INFO",
"state": "UNKNOWN",
"time": "2023-05-03T16:44:56+02:00",
"TZONE": "CEST",
"user_action": "N/A",
"full_identifier": "317908494475311923/6/filesystem//t123fs"
}
]
}
[clusterstate]
...
# true = allow CSM to override NFS/SMB missing export events on the CES nodes (set to FAILED)
# false = CSM does not override NFS/SMB missing export events on the CES nodes
csmsetmissingexportsfailed = true
3. Close the editor and restart the system health monitor using the following command:
mmsysmoncontrol restart
4. Run this procedure on all the nodes or copy the modified files to all nodes and restart the system
health monitor on all nodes.
Important: During the restart of a node, some internal checks are done by the system health monitor for
a file system's availability if NFS or SMB is enabled. These checks detect if all the required file systems
for the declared exports are available. There might be cases where file systems are not available or are
unmounted at the time of the check. This might be a timing issue, or because some file systems are not
automatically mounted. In such cases, the NFS service is not started and remains in a STOPPED state
even if all relevant file systems are available at a later point in time.
This feature can be configured as follows:
1. Make a backup copy of the current mmsysmonitor.conf file.
2. Open the file with a text editor, and search for the nfs section to set the value of
preventnfsstartuponmissingfs to true or false:
# NFS settings
#
[nfs]
...
# prevent NFS startup after reboot/mmstartup if not all required filesystems for exports are
available
# true = prevent startup / false = allow startup
preventnfsstartuponmissingfs = true
3. Close the editor and restart the system health monitor using the following command:
mmsysmoncontrol restart
4. Run this procedure on all the nodes or copy the modified files to all nodes and restart the system
health monitor on all nodes.
Note: Ensure that in this case, the issue described by this health event is resolved. Otherwise, it would
reappear in the mmhealth command output.
If a user intentionally does not want to solve the reported issue and no longer wants to be warned, then
use the following command to disable this particular check:
Only the HEALTHCHECK events, which are received from the IBM health check service, can be hidden.
These event names start with hc_.
Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
The collected data can be used for the following purposes:
• Track I/O demand over longer periods of time - weeks or months.
• Record I/O patterns over time (when peak usage occurs, and so forth).
• Determine whether some nodes service more application demand than others.
• Monitor the I/O patterns of a single application, which is spread across multiple nodes.
• Record application I/O request service times.
Figure 1 on page 59 shows the software layers in a typical system with GPFS. The mmpmon command is
built into GPFS.
Related concepts
Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.
Understanding the request histogram facility
Use the mmpmon rhist command requests to control the request histogram facility.
Understanding the Remote Procedure Call (RPC) facility
The mmpmon requests that start with rpc_s displays an aggregation of execution time taken by RPCs
for a time unit, for example the last 10 seconds. The statistics displayed are the average, minimum, and
maximum of RPC execution time over the last 60 seconds, 60 minutes, 24 hours, and 30 days.
Related tasks
Display I/O statistics per mounted file system
Related concepts
Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.
Understanding the request histogram facility
Use the mmpmon rhist command requests to control the request histogram facility.
Understanding the Remote Procedure Call (RPC) facility
The mmpmon requests that start with rpc_s displays an aggregation of execution time taken by RPCs
for a time unit, for example the last 10 seconds. The statistics displayed are the average, minimum, and
maximum of RPC execution time over the last 60 seconds, 60 minutes, 24 hours, and 30 days.
Related tasks
Specifying input to the mmpmon command
The input requests to the mmpmon command allow the system administrator to collect I/O statistics per
mounted file system (fs_io_s) or for the entire node (io_s).
Display I/O statistics for the entire node
The io_s input request to the mmpmon command allows the system administrator to collect I/O statistics
for the entire node.
Reset statistics to zero
The reset request resets the statistics that are displayed with fs_io_s and io_s requests. The reset
request does not reset the histogram data, which is controlled and displayed with rhist requests.
Displaying mmpmon version
fs_io_s
mmpmon -p -i commandFile
_fs_io_s_ _n_ 199.18.1.8 _nn_ node1 _rc_ 0 _t_ 1066660148 _tu_ 407431 _cl_ myCluster.xxx.com
_fs_ gpfs2 _d_ 2 _br_ 6291456 _bw_ 314572800 _oc_ 10 _cc_ 16 _rdc_ 101 _wc_ 300 _dir_ 7 _iu_ 2
_fs_io_s_ _n_ 199.18.1.8 _nn_ node1 _rc_ 0 _t_ 1066660148 _tu_ 407455 _cl_ myCluster.xxx.com
_fs_ gpfs1 _d_ 3 _br_ 5431636 _bw_ 173342800 _oc_ 6 _cc_ 8 _rdc_ 54 _wc_ 156 _dir_ 3 _iu_ 6
The output consists of one string per mounted file system. In this example, there are two mounted file
systems, gpfs1 and gpfs2.
If the -p flag is not specified, then the output is similar to:
When no file systems are mounted, the responses are similar to:
_fs_io_s_ _n_ 199.18.1.8 _nn_ node1 _rc_ 1 _t_ 1066660148 _tu_ 407431 _cl_ - _fs_ -
For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.
Table 11. Keywords and values for the mmpmon io_s response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
_br_ Total number of bytes that are read from both disk and cache.
_bw_ Total number of bytes that are written to both disk and cache.
_oc_ Count of open() call requests that are serviced by GPFS. The open count also
includes creat() call counts.
_cc_ Number of close() call requests that are serviced by GPFS.
_rdc_ Number of application read requests that are serviced by GPFS.
_wc_ Number of application write requests that are serviced by GPFS.
_dir_ Number of readdir() call requests that are serviced by GPFS.
_iu_ Number of inode updates to disk, which includes inodes flushed to disk because of
access time updates.
Related concepts
Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.
Understanding the request histogram facility
io_s
mmpmon -p -i commandFile
_io_s_ _n_ 199.18.1.8 _nn_ node1 _rc_ 0 _t_ 1066660148 _tu_ 407431 _br_ 6291456
_bw_ 314572800 _oc_ 10 _cc_ 16 _rdc_ 101 _wc_ 300 _dir_ 7 _iu_ 2
Table 13. Keywords and values for the mmpmon nlist add response
Keyword Description
_n_ IP address of the node that is processing the node list. This is the address by which
GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is add.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
_c_ The number of nodes in the user supplied list.
_ni_ Node name input. A user-supplied node name from the offered list of names.
_nx_ Node name conversion. The preferred GPFS name for the node.
_nxip_ Node name converted IP address. The preferred GPFS IP address for the node.
_did_ The number of nodes names considered valid and processed by the requests.
_nlc_ The number of nodes in the node list now (after all processing).
If the nlist add request is issued when no node list exists, it is handled as if it were an nlist new
request.
mmpmon -p -i commandFile
Note in this example that an alias name n2 was used for node2, and an IP address was used for node1.
Notice how the values for _ni_ and _nx_ differ in these cases.
The output is similar to this:
_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ add _rc_ 0 _t_ 1121955894 _tu_ 261881 _c_ 2
_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ add _rc_ 0 _t_ 1121955894 _tu_ 261881 _ni_ n2 _nx_
node2 _nxip_ 199.18.1.5
_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ add _rc_ 0 _t_ 1121955894 _tu_ 261881 _ni_
199.18.1.2 _nx_ node1 _nxip_ 199.18.1.2
_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ add _rc_ 0 _t_ 1121955894 _tu_ 261881 _did_ 2 _nlc_
2
The requests nlist add and nlist sub behave in a similar way and use the same keyword and
response format.
These requests are rejected if issued while quorum has been lost.
Table 14. Keywords and values for the mmpmon nlist del response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is del.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
nlist del
mmpmon -p -i commandFile
_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ del _rc_ 0 _t_ 1121956817 _tu_ 46050
mmpmon node 199.18.1.2 name node1 nlist del status OK timestamp 1121956908/396381
Table 15. Keywords and values for the mmpmon nlist new response
Keyword Description
_n_ IP address of the node that is responding. This is the address by which GPFS knows
the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is new.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
Table 16. Keywords and values for the mmpmon nlist s response
Keyword Description
_n_ IP address of the node that is processing the request. This is the address by which
GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is s.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
_c_ Number of nodes in the node list.
_mbr_ GPFS preferred node name for the list member.
_ip_ GPFS preferred IP address for the list member.
nlist s
mmpmon -p -i commandFile
_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ s _rc_ 0 _t_ 1121956950 _tu_ 863292 _c_ 2
_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ s _rc_ 0 _t_ 1121956950 _tu_ 863292 _mbr_ node1
_ip_ 199.18.1.2
_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ s _rc_ 0 _t_ 1121956950 _tu_ 863292 _mbr_
node2 _ip_ 199.18.1.5
_nlist_ _n_ 199.18.1.2 _nn_ node1 _req_ s _rc_ 0 _t_ 1121957395 _tu_ 910440 _c_ 0
The nlist s request is rejected if issued while quorum has been lost. Only one response line is
presented.
_failed_ _n_ 199.18.1.8 _nn_ node2 _rc_ 668 _t_ 1121957395 _tu_ 910440
mmpmon node 199.18.1.8 name node2: failure status 668 timestamp 1121957395/910440
lost quorum
mmpmon -p -i command_file
_fs_io_s_ _n_ 199.18.1.2 _nn_ node1 _rc_ 0 _t_ 1121974197 _tu_ 278619 _cl_
xxx.localdomain _fs_ gpfs2 _d_ 2 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0
_dir_ 0 _iu_ 0
_fs_io_s_ _n_ 199.18.1.2 _nn_ node1 _rc_ 0 _t_ 1121974197 _tu_ 278619 _cl_
xxx.localdomain _fs_ gpfs1 _d_ 1 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0
_dir_ 0 _iu_ 0
_fs_io_s_ _n_ 199.18.1.5 _nn_ node2 _rc_ 0 _t_ 1121974167 _tu_ 116443 _cl_
cl1.xxx.com _fs_ fs3 _d_ 3 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0 _dir_ 0
_iu_ 3
_fs_io_s_ _n_ 199.18.1.5 _nn_ node2 _rc_ 0 _t_ 1121974167 _tu_ 116443 _cl_
cl1.xxx.comm _fs_ fs2 _d_ 2 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0 _dir_ 0
_iu_ 0
_fs_io_s_ _n_ 199.18.1.5 _nn_ node2 _rc_ 0 _t_ 1121974167 _tu_ 116443 _cl_
xxx.localdomain _fs_ gpfs2 _d_ 2 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0
_dir_ 0 _iu_ 0
The responses from a propagated request are the same as they are issued on each node separately.
If the -p flag is not specified, the output is similar to:
For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.
_failed_ _n_ 199.18.1.5 _nn_ node2 _fn_ 199.18.1.2 _fnn_ node1 _rc_ 233
_t_ 1121974459 _tu_ 602231
_fs_io_s_ _n_ 199.18.1.5 _nn_ node2 _rc_ 0 _t_ 1121974459 _tu_ 616867 _cl_
cl1.xxx.com _fs_ fs2 _d_ 2 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0 _dir_ 0
_iu_ 0
_fs_io_s_ _n_ 199.18.1.5 _nn_ node2 _rc_ 0 _t_ 1121974459 _tu_ 616867 _cl_
cl1.xxx.com _fs_ fs3 _d_ 3 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0 _dir_ 0
_iu_ 0
_fs_io_s_ _n_ 199.18.1.5 _nn_ node2 _rc_ 0 _t_ 1121974459 _tu_ 616867 _cl_
node1.localdomain _fs_ gpfs2 _d_ 2 _br_ 0 _bw_ 0 _oc_ 0 _cc_ 0 _rdc_ 0 _wc_ 0
_failed_ _n_ 199.18.1.2 _nn_ node1 _rc_ 668 _t_ 1121974459 _tu_ 616867
mmpmon node 199.18.1.2 name node1: failure status 668 timestamp 1121974459/616867
lost quorum
In this scenario there can be a window where node2 is down and node1 has not yet lost quorum. When
quorum loss occurs, the mmpmon command does not attempt to communicate with any nodes in the node
list. The goal with failure handling is to accurately maintain the node list across node failures, so that
when nodes come back up they again contribute to the aggregated responses.
Table 17. Keywords and values for the mmpmon nlist failures
Keyword Description
_n_ IP address of the node processing the node list. This is the address by which GPFS
knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_fn_ IP address of the node that is no longer responding to mmpmon requests.
_fnn_ The name by which GPFS knows the node that is no longer responding to mmpmon
requests
_rc_ Indicates the status of the operation. See “Return codes from mmpmon” on page 104.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
Table 18. Keywords and values for the mmpmon reset response
Keyword Description
_n_ IP address of the node that is responding. This is the address by which GPFS knows
the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
Related concepts
Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.
Understanding the request histogram facility
Use the mmpmon rhist command requests to control the request histogram facility.
Understanding the Remote Procedure Call (RPC) facility
The mmpmon requests that start with rpc_s displays an aggregation of execution time taken by RPCs
for a time unit, for example the last 10 seconds. The statistics displayed are the average, minimum, and
maximum of RPC execution time over the last 60 seconds, 60 minutes, 24 hours, and 30 days.
Related tasks
Specifying input to the mmpmon command
The input requests to the mmpmon command allow the system administrator to collect I/O statistics per
mounted file system (fs_io_s) or for the entire node (io_s).
Display I/O statistics per mounted file system
The fs_io_s input request to the mmpmon command allows the system administrator to collect I/O
statistics per mounted file system.
Display I/O statistics for the entire node
The io_s input request to the mmpmon command allows the system administrator to collect I/O statistics
for the entire node.
Displaying mmpmon version
The ver request returns a string containing version information.
Related reference
Example mmpmon scenarios and how to analyze and interpret their results
This topic is an illustration of how mmpmon is used to analyze I/O data and draw conclusions based on it.
Other information about mmpmon output
reset
mmpmon -p -i commandFile
_reset_ _n_ 199.18.1.8 _nn_ node1 _rc_ 0 _t_ 1066660148 _tu_ 407431
For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.
Related concepts
Overview of mmpmon
512;1m;4m
0 to 512 bytes
513 to 1048576 bytes
1048577 to 4194304 bytes
4194305 and greater bytes
In this example, a read of size 3 MB would fall in the third size range, a write of size 20 MB would fall in
the fourth size range.
0 to 255 bytes
256 to 511 bytes
512 to 1023 bytes
1024 to 2047 bytes
2048 to 4095 bytes
4096 to 8191 bytes
8192 to 16383 bytes
16384 to 32767 bytes
32768 to 65535 bytes
65536 to 131071 bytes
131072 to 262143 bytes
262144 to 524287 bytes
524288 to 1048575 bytes
1048576 to 2097151 bytes
2097152 to 4194303 bytes
4194304 and greater bytes
The last size range collects all request sizes greater than or equal to 4 MB. The request size ranges can be
changed by using the rhist nr request.
For more information, see “Processing of rhist nr” on page 79.
1.3;4.59;10
In this example, a read that completes in 0.85 milliseconds falls into the first latency range. A write that
completes in 4.56 milliseconds falls into the second latency range, due to the truncation.
A latency range operand of = (equal sign) indicates that the current latency range is not to be changed.
A latency range operand of * (asterisk) indicates that the current latency range is to be changed to the
default latency range. If the latency range operand is missing, * (asterisk) is assumed. A maximum of 15
numbers may be specified, which produces 16 total latency ranges.
The latency times are in milliseconds. The default latency ranges are:
The last latency range collects all latencies greater than or equal to 1000.1 milliseconds. The latency
ranges can be changed by using the rhist nr request.
For more information, see “Processing of rhist nr” on page 79.
Changing the request histogram facility request size and latency ranges
The rhist nr (new range) request allows the user to change the size and latency ranges used in the
request histogram facility.
The use of rhist nr implies an rhist reset. Counters for read and write operations are recorded
separately. If there are no mounted file systems at the time rhist nr is issued, the request still runs.
The size range operand appears first, followed by a blank, and then the latency range operand.
Table 20 on page 79 describes the keywords for the rhist nr response, in the order that they appear
in the output. These keywords are used only when mmpmon is invoked with the -p flag.
Table 20. Keywords and values for the mmpmon rhist nr response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is nr.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.
Processing of rhist nr
The rhist nr request changes the request histogram facility request size and latency ranges.
Processing of rhist nr is as follows:
1. The size range and latency range operands are parsed and checked for validity. If they are not valid, an
error is returned and processing terminates.
2. The histogram facility is disabled.
3. The new ranges are created, by defining the following histogram counters:
a. Two sets, one for read and one for write.
b. Within each set, one category for each size range.
c. Within each size range category, one counter for each latency range.
For example, if the user specifies 11 numbers for the size range operand and 2 numbers for the
latency range operand, this produces 12 size ranges, each having 3 latency ranges, because there is
one additional range for the top endpoint. The total number of counters is 72: 36 read counters and
36 write counters.
4. The new ranges are made current.
5. The old ranges are discarded. Any accumulated histogram data is lost.
The histogram facility must be explicitly enabled again using rhist on to begin collecting histogram data
using the new ranges.
The mmpmon command does not have the ability to collect data only for read operations, or only for write
operations. The mmpmon command does not have the ability to specify size or latency ranges that have
mmpmon -p -i commandFile
_rhist_ _n_ 199.18.2.5 _nn_ node1 _req_ nr 512;1m;4m 1.3;4.5;10 _rc_ 0 _t_ 1078929833 _tu_
765083
In this case, mmpmon has been instructed to keep a total of 32 counters. There are 16 for read and 16 for
write. For the reads, there are four size ranges, each of which has four latency ranges. The same is true for
the writes. They are as follows:
In this example, a read of size 15 MB that completes in 17.8 milliseconds would fall in the last latency
range listed here. When this read completes, the counter for the last latency range is increased by one.
An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.
An example of an unsuccessful response is:
_rhist_ _n_ 199.18.2.5 _nn_ node1 _req_ nr 512;1m;4m 1;4;8;2 _rc_ 22 _t_ 1078929596 _tu_ 161683
mmpmon node 199.18.1.8 name node1 rhist nr 512;1m;4m 1;4;8;2 status 22 range error
In this case, the last value in the latency range, 2, is out of numerical order.
Table 21. Keywords and values for the mmpmon rhist off response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is off.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.
rhist off
mmpmon -p -i commandFile
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ off _rc_ 0 _t_ 1066938820 _tu_ 5755
An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.
For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.
Table 22. Keywords and values for the mmpmon rhist on response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is on.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.
rhist on
mmpmon -p -i commandFile
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ on _rc_ 0 _t_ 1066936484 _tu_ 179346
An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.
For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.
Table 23. Keywords and values for the mmpmon rhist p response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is p.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
_k_ The kind, r or w, (read or write) depending on what the statistics are for.
_R_ Request size range, minimum, and maximum number of bytes.
_L_ Latency range, minimum and maximum, in milliseconds.
The request size ranges are in bytes. The zero-value used for the higher limit of the last size range means
'and higher'. The request size ranges can be changed by using the rhist nr request.
The latency times are in milliseconds. The zero-value used for the higher limit of the last latency range
means 'and higher'. The latency ranges can be changed by using the rhist nr request.
The rhist p request allows an application to query for the entire latency pattern. The application
can then configure itself accordingly. Since latency statistics are reported only for ranges with non-zero
counts, the statistics responses might be sparse. By querying for the pattern, an application can be
certain to learn the complete histogram set. The user may have changed the pattern by using the rhist
nr request. For this reason, an application should query for the pattern and analyze it before requesting
statistics.
If the facility has never been enabled, then the _rc_ field is non-zero. An _rc_ value of 16 indicates that
the histogram operations lock is busy. Retry the request.
If the facility has been previously enabled, then the rhist p request still displays the pattern even when
rhist off is currently in effect.
If there are no mounted file systems at the time rhist p is issued, then the pattern is still displayed.
rhist p
mmpmon -p -i commandFile
The response contains all the latency ranges inside each of the request ranges. The data are separate for
read and write:
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ p _rc_ 0 _t_ 1066939007 _tu_ 386241 _k_ r
... data for reads ...
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ p _rc_ 0 _t_ 1066939007 _tu_ 386241 _k_ w
... data for writes ...
_end_
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ p _rc_ 0 _t_ 1066939007 _tu_ 386241 _k_ r
_R_ 0 255
_L_ 0.0 1.0
_L_ 1.1 10.0
_L_ 10.1 30.0
_L_ 30.1 100.0
_L_ 100.1 200.0
_L_ 200.1 400.0
_L_ 400.1 800.0
_L_ 800.1 1000.0
_L_ 1000.1 0
_R_ 256 511
_L_ 0.0 1.0
_L_ 1.1 10.0
_L_ 10.1 30.0
_L_ 30.1 100.0
_L_ 100.1 200.0
_L_ 200.1 400.0
_L_ 400.1 800.0
_L_ 800.1 1000.0
_L_ 1000.1 0
_R_ 512 1023
_L_ 0.0 1.0
_L_ 1.1 10.0
_L_ 10.1 30.0
_L_ 30.1 100.0
_L_ 100.1 200.0
_L_ 200.1 400.0
_L_ 400.1 800.0
_L_ 800.1 1000.0
_L_ 1000.1 0
...
_R_ 4194304 0
_L_ 0.0 1.0
_L_ 1.1 10.0
_L_ 10.1 30.0
_L_ 30.1 100.0
_L_ 100.1 200.0
_L_ 200.1 400.0
_L_ 400.1 800.0
_L_ 800.1 1000.0
_L_ 1000.1 0
If the facility has never been enabled, then the _rc_ field is non-zero.
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ p _rc_ 1 _t_ 1066939007 _tu_ 386241
For information on interpreting the mmpmon command output results, see “Other information about
mmpmon output” on page 103.
Table 24. Keywords and values for the mmpmon rhist reset response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is reset.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
If the facility has been previously enabled, then the reset request still resets the statistics even when
rhist off is currently in effect. If there are no mounted file systems at the time rhist reset is
issued, then the statistics are still reset.
An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.
rhist reset
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ reset _rc_ 0 _t_ 1066939007 _tu_ 386241
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ reset _rc_ 0 _t_ 1066939007 _tu_ 386241
If the facility has never been enabled, then the _rc_ value is non-zero:
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ reset _rc_ 1 _t_ 1066939143 _tu_ 148443
For information on interpreting the mmpmon command output results, see “Other information about
mmpmon output” on page 103.
Table 25. Keywords and values for the mmpmon rhist s response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_req_ The action requested. In this case, the value is s.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Microseconds part of the current time of day.
_k_ The kind, r or w, (read or write) depending on what the statistics are for.
_R_ Request size range, minimum and maximum number of bytes.
_NR_ Number of requests that fell in this size range.
_L_ Latency range, minimum and maximum, in milliseconds.
_NL_ Number of requests that fell in this latency range. The sum of all _NL_ values for a
request size range equals the _NR_ value for that size range.
If the facility has been previously enabled, the rhist s request still displays the statistics even if rhist
off is currently in effect. This allows turning the histogram statistics on and off between known points
and reading them later. If there are no mounted file systems at the time rhist s is issued, then the
statistics are still displayed.
An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.
rhist s
mmpmon -p -i commandFile
_rhist_ _n_ 199.18.2.5 _nn_ node1 _req_ s _rc_ 0 _t_ 1066939007 _tu_ 386241 _k_ r
_R_ 65536 131071 _NR_ 32640
_L_ 0.0 1.0 _NL_ 25684
_L_ 1.1 10.0 _NL_ 4826
_L_ 10.1 30.0 _NL_ 1666
_L_ 30.1 100.0 _NL_ 464
_R_ 262144 524287 _NR_ 8160
_L_ 0.0 1.0 _NL_ 5218
_L_ 1.1 10.0 _NL_ 871
_L_ 10.1 30.0 _NL_ 1863
_L_ 30.1 100.0 _NL_ 208
_R_ 1048576 2097151 _NR_ 2040
_L_ 1.1 10.0 _NL_ 558
_L_ 10.1 30.0 _NL_ 809
_L_ 30.1 100.0 _NL_ 673
_rhist_ _n_ 199.18.2.5 _nn_ node1 _req_ s _rc_ 0 _t_ 1066939007 _tu_ 386241 _k_ w
_R_ 131072 262143 _NR_ 12240
_L_ 0.0 1.0 _NL_ 10022
_L_ 1.1 10.0 _NL_ 1227
_L_ 10.1 30.0 _NL_ 783
_L_ 30.1 100.0 _NL_ 208
_R_ 262144 524287 _NR_ 6120
_L_ 0.0 1.0 _NL_ 4419
_L_ 1.1 10.0 _NL_ 791
_L_ 10.1 30.0 _NL_ 733
_L_ 30.1 100.0 _NL_ 177
_R_ 524288 1048575 _NR_ 3060
_L_ 0.0 1.0 _NL_ 1589
_L_ 1.1 10.0 _NL_ 581
_L_ 10.1 30.0 _NL_ 664
_L_ 30.1 100.0 _NL_ 226
_R_ 2097152 4194303 _NR_ 762
_L_ 1.1 2.0 _NL_ 203
_L_ 10.1 30.0 _NL_ 393
_L_ 30.1 100.0 _NL_ 166
_end_
This small example shows that the reports for read and write may not present the same number of ranges
or even the same ranges. Only those ranges with non-zero counters are represented in the response. This
is true for both the request size ranges and the latency ranges within each request size range.
If the -p flag is not specified, then the output is similar to:
If the facility has never been enabled, then the _rc_ value is non-zero:
_rhist_ _n_ 199.18.1.8 _nn_ node1 _req_ reset _rc_ 1 _t_ 1066939143 _tu_ 148443
An _rc_ value of 16 indicates that the histogram operations lock is busy. Retry the request.
For information on interpreting the mmpmon command output results, see “Other information about
mmpmon output” on page 103.
The information displayed with rpc_s is similar to what is displayed with the mmdiag --rpc command.
Related concepts
Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.
Understanding the request histogram facility
Use the mmpmon rhist command requests to control the request histogram facility.
Related tasks
Specifying input to the mmpmon command
Table 27. Keywords and values for the mmpmon rpc_s response
Keyword Description
_req_ Indicates the action requested. The action can be either size, node, or message. If
no action is requested, the default is the rpc_s action.
_n_ Indicates the IP address of the node responding. This is the address by which GPFS
knows the node.
_nn_ Indicates the hostname that corresponds to the IP address (the _n_ value).
_rn_ Indicates the IP address of the remote node responding. This is the address by which
GPFS knows the node. The statistics displayed are the averages from _nn_ to this
_rnn_.
_rnn_ Indicates the hostname that corresponds to the remote node IP address (the _rn_
value). The statistics displayed are the averages from _nn_ to this _rnn_.
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Indicates the microseconds part of the current time of day.
_rpcObj_ Indicates the beginning of the statistics for _obj_.
_obj_ Indicates the RPC object being displayed.
_nsecs_ Indicates the number of one-second intervals maintained.
_nmins_ Indicates the number of one-minute intervals maintained.
_nhours_ Indicates the number of one-hour intervals maintained.
rpc_s
mmpmon -p -i commandFile
Object: AG_STAT_CHANNEL_WAIT
nsecs: 60
nmins: 60
nhours: 24
ndays: 30
TimeUnit: sec
AverageValue: 0.000
MinValue: 0.000
MaxValue: 0.000
Countvalue: 0
TimeUnit: sec
AverageValue: 0.000
MinValue: 0.000
MaxValue: 0.000
Countvalue: 0
TimeUnit: sec
AverageValue: 0.000
MinValue: 0.000
MaxValue: 0.000
Countvalue: 0
TimeUnit: sec
AverageValue: 0.000
MinValue: 0.000
MaxValue: 0.000
Countvalue: 0
TimeUnit: sec
AverageValue: 0.000
MinValue: 0.000
MaxValue: 0.00
For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.
Displaying the Remote Procedure Call (RPC) execution time according to the
size of messages
The rpc_s size request returns the cached RPC-related size statistics.
Table 28 on page 91 describes the keywords for the rpc_s size response, in the order that they
appear in the output.
Table 28. Keywords and values for the mmpmon rpc_s size response
Keyword Description
_req_ Indicates the action requested. In this case, the value is rpc_s size.
_n_ Indicates the IP address of the node responding. This is the address by which GPFS
knows the node.
_nn_ Indicates the hostname that corresponds to the IP address (the _n_ value).
_rc_ Indicates the status of the operation.
_t_ Indicates the current time of day in seconds (absolute seconds since Epoch (1970)).
_tu_ Indicates the microseconds part of the current time of day.
_rpcSize_ Indicates the beginning of the statistics for this _size_ group.
_size_ Indicates the size of the messages for which statistics are collected.
_nsecs_ Indicates the number of one-second intervals maintained.
_nmins_ Indicates the number of one-minute intervals maintained.
_nhours_ Indicates the number of one-hour intervals maintained.
rpc_s size
mmpmon -p -i commandFile
_mmpmon::rpc_s_ _req_ size _n_ 192.168.56.167 _nn_ node2 _rc_ 0 _t_ 1388417852 _tu_ 572950
_rpcSize_ _size_ 64 _nsecs_ 60 _nmins_ 60 _nhours_ 24 _ndays_ 30
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
…...................
…...................
…...................
_rpcSize_ _size_ 256 _nsecs_ 60 _nmins_ 60 _nhours_ 24 _ndays_ 30
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ sec _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
…..................
…..................
_stats_ _tmu_ min _av_ 0.692, _min_ 0.692, _max_ 0.692, _cnt_ 1
_stats_ _tmu_ min _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ min _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_stats_ _tmu_ min _av_ 0.000, _min_ 0.000, _max_ 0.000, _cnt_ 0
_response_ end
If the -p flag is not specified, the output is similar to the following example:
Bucket size: 64
nsecs: 60
nmins: 60
nhours: 24
ndays: 30
TimeUnit: sec
AverageValue: 0.000
MinValue: 0.000
MaxValue: 0.000
Countvalue: 0
TimeUnit: sec
AverageValue: 0.000
TimeUnit: sec
AverageValue: 0.000
MinValue: 0.000
MaxValue: 0.000
Countvalue: 0
TimeUnit: sec
AverageValue: 0.131
MinValue: 0.131
MaxValue: 0.131
Countvalue: 1
TimeUnit: sec
AverageValue: 0.000
MinValue: 0.000
MaxValue: 0.000
Countvalue: 0
For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.
Table 29. Keywords and values for the mmpmon ver response
Keyword Description
_n_ IP address of the node responding. This is the address by which GPFS knows the node.
_nn_ The hostname that corresponds to the IP address (the _n_ value).
_v_ The version of mmpmon.
_lv_ The level of mmpmon.
_vt_ The fix level variant of mmpmon.
Related concepts
Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.
Understanding the request histogram facility
Use the mmpmon rhist command requests to control the request histogram facility.
Understanding the Remote Procedure Call (RPC) facility
The mmpmon requests that start with rpc_s displays an aggregation of execution time taken by RPCs
for a time unit, for example the last 10 seconds. The statistics displayed are the average, minimum, and
maximum of RPC execution time over the last 60 seconds, 60 minutes, 24 hours, and 30 days.
Related tasks
Specifying input to the mmpmon command
The input requests to the mmpmon command allow the system administrator to collect I/O statistics per
mounted file system (fs_io_s) or for the entire node (io_s).
Display I/O statistics per mounted file system
ver
mmpmon -p -i commandFile
For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.
Example mmpmon scenarios and how to analyze and interpret their results
This topic is an illustration of how mmpmon is used to analyze I/O data and draw conclusions based on it.
The fs_io_s and io_s requests are used to determine a number of GPFS I/O parameters and their
implication for overall performance. The rhist requests are used to produce histogram data about I/O
sizes and latency times for I/O requests. The request source and prefix directive once allow the user of
mmpmon to more finely tune its operation.
Related concepts
Overview of mmpmon
The mmpmon command allows the system administrator to collect I/O statistics from the point of view of
GPFS servicing application I/O requests.
Understanding the node list facility
The node list facility can be used to invoke the mmpmon command on multiple nodes and gather data from
other nodes in the cluster. The following table describes the nlist requests for the mmpmon command.
Understanding the request histogram facility
Use the mmpmon rhist command requests to control the request histogram facility.
Understanding the Remote Procedure Call (RPC) facility
fs_io_s and io_s output - how to aggregate and analyze the results
The fs_io_s and io_s requests can be used to determine a number of GPFS I/O parameters and their
implication for overall performance.
The output from the fs_io_s and io_s requests can be used to determine:
1. The I/O service rate of a node, from the application point of view. The io_s request presents this as a
sum for the entire node, while fs_io_s presents the data per file system. A rate can be approximated
by taking the _br_ (bytes read) or _bw_ (bytes written) values from two successive invocations of
fs_io_s (or io_s_) and dividing by the difference of the sums of the individual _t_ and _tu_ values
(seconds and microseconds).
This must be done for a number of samples, with a reasonably small time between samples, in order
to get a rate which is reasonably accurate. Since we are sampling the information at a given interval,
inaccuracy can exist if the I/O load is not smooth over the sampling time.
For example, here is a set of samples taken approximately one second apart, when it was known that
continuous I/O activity was occurring:
_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862476 _tu_ 634939 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 3737124864 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 3570 _dir_ 0 _iu_ 5
_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862477 _tu_ 645988 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 3869245440 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 3696 _dir_ 0 _iu_ 5
_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862478 _tu_ 647477 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 4120903680 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 3936 _dir_ 0 _iu_ 5
_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862479 _tu_ 649363 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 4309647360 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 4116 _dir_ 0 _iu_ 5
_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862480 _tu_ 650795 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 4542431232 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 4338 _dir_ 0 _iu_ 5
_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862482 _tu_ 654025 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 4963958784 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 4740 _dir_ 0 _iu_ 5
_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862483 _tu_ 655782 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 5177868288 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 4944 _dir_ 0 _iu_ 5
_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862484 _tu_ 657523 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 5391777792 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 5148 _dir_ 0 _iu_ 5
_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862485 _tu_ 665909 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 5599395840 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 5346 _dir_ 0 _iu_ 5
BEGIN {
count=0;
prior_t=0;
prior_tu=0;
prior_br=0;
prior_bw=0;
}
{
count++;
t = $9;
tu = $11;
br = $19;
bw = $21;
if(count > 1)
{
delta_t = t-prior_t;
delta_tu = tu-prior_tu;
delta_br = br-prior_br;
delta_bw = bw-prior_bw;
dt = delta_t + (delta_tu / 1000000.0);
if(dt > 0) {
rrate = (delta_br / dt) / 1000000.0;
wrate = (delta_bw / dt) / 1000000.0;
prior_t=t;
prior_tu=tu;
prior_br=br;
prior_bw=bw;
}
The calculated service rates for each adjacent pair of samples is:
Since these are discrete samples, there can be variations in the individual results. For example, there
may be other activity on the node or interconnection fabric. I/O size, file system block size, and
buffering also affect results. There can be many reasons why adjacent values differ. This must be taken
into account when building analysis tools that read mmpmon output and interpreting results.
For example, suppose a file is read for the first time and gives results like this.
If most or all of the file remains in the GPFS cache, the second read may give quite different rates:
Considerations such as these need to be taken into account when looking at application I/O service
rates calculated from sampling mmpmon data.
2. Usage patterns, by sampling at set times of the day (perhaps every half hour) and noticing when the
largest changes in I/O volume occur. This does not necessarily give a rate (since there are too few
samples) but it can be used to detect peak usage periods.
3. If some nodes service significantly more I/O volume than others over a given time span.
4. When a parallel application is split across several nodes, and is the only significant activity in the
nodes, how well the I/O activity of the application is distributed.
5. The total I/O demand that applications are placing on the cluster. This is done by obtaining results
from fs_io_s and io_s in aggregate for all nodes in a cluster.
6. The rate data may appear to be erratic. Consider this example:
The low rates which appear before and after each group of higher rates can be due to the I/O requests
occurring late (in the leading sampling period) and ending early (in the trailing sampling period.) This
gives an apparently low rate for those sampling periods.
The zero rates in the middle of the example could be caused by reasons such as no I/O requests
reaching GPFS during that time period (the application issued none, or requests were satisfied by
buffered data at a layer on top of GPFS), the node becoming busy with other work (causing the
application to be undispatched), or other reasons.
For information on interpreting the mmpmon command output results, see “Other information about
mmpmon output” on page 103.
Request histogram (rhist) output - how to aggregate and analyze the results
The rhist requests are used to produce histogram data about I/O sizes and latency times for I/O
requests.
The output from the rhist requests can be used to determine:
1. The number of I/O requests in a given size range. The sizes may vary based on operating system,
explicit application buffering, and other considerations. This information can be used to help
ver
reset
fs_io_s
rhist s
opens: 0
closes: 0
reads: 0
writes: 0
readdir: 0
inode updates: 0
mmpmon node 199.18.1.8 name node1 rhist s OK read timestamp 1129770175/951117
mmpmon node 199.18.1.8 name node1 rhist s OK write timestamp 1129770175/951125
mmpmon node 199.18.1.8 name node1 fs_io_s OK
cluster: node1.localdomain
filesystem: gpfs1
disks: 1
timestamp: 1129770180/952462
bytes read: 0
bytes written: 0
opens: 0
closes: 0
reads: 0
writes: 0
readdir: 0
inode updates: 0
For information on interpreting mmpmon output results, see “Other information about mmpmon output” on
page 103.
Collector
In the older versions of IBM Storage Scale, the performance monitoring tool was configured only with
a single collector, which supported up to 150 sensor nodes. The performance monitoring tool can be
configured with multiple collectors to increase scalability and fault-tolerance, and this configuration is
referred to as multi-collector federation.
In a multi-collector federated configuration, the collectors need to be aware of each other. Otherwise,
a collector returns only the data that is stored in its own measurement database. When the collectors
are aware of their peer collectors, they can collaborate with each other to collate measurement data for
a specific measurement query. All collectors that are part of the federation are specified in the peers
configuration option in the collector’s configuration file as shown in the following example:
The port number is the one specified by the federationport configuration option, typically set to 9085.
You can also list the current host so that the same configuration file can be used for all the collectors.
Note: A Linux operating system user is added to the host. This user ID, scalepm, is used by the
pmcollector to run the process in the context of the new user. However, the scalepm ID does not
have privilege to log in to the system.
When the peers are specified, any query for measurement data is directed to any of the collectors that
are listed in the peers section. The collector collects and assembles a response based on all relevant data
from all collectors. Hence, clients need to contact only a single collector instead of all of them to get all
the measurements available in the system.
To distribute the measurement data reported by sensors over multiple collectors, multiple collectors
might be specified when the sensors are configured.
If multiple collectors are specified, the sensors pick one to report their measurement data to. The sensors
use stable hashes to pick the collector such that the sensor-collector relationship does not change too
much if new collectors are added or if a collector is removed.
Additionally, sensors and collectors can be configured for high availability. In this setting, sensors report
their measurement data to more than one collector such that the failure of a single collector would
not lead to any data loss. For instance, if the collector redundancy is increased to two, every sensor
reports to two collectors. As a side-effect of increasing the redundancy to two, the bandwidth that
is used for reporting measurement data is duplicated. The collector redundancy must be configured
before the sensor configuration is stored in IBM Storage Scale by changing the colRedundancy option
in /opt/IBM/zimon/ZIMonSensors.cfg.
Sensor
A sensor is a component that collects performance data from a node. Typically, multiple sensors run on
any node that is needed to collect metrics. By default, the sensors are started on every node.
Proxy
A proxy is run for each of the protocols to collect the metrics for that protocol.
By default, the NFS, and SMB proxies are started automatically with those protocols. They do not need
to be started or stopped. However, to retrieve metrics for SMB, NFS, or Object, these protocols must be
active on the specific node.
For information, see the “Enabling protocol metrics” on page 152 topic.
For information on enabling Transparent cloud tiering metrics, see Integrating Transparent Cloud Tiering
metrics with performance monitoring tool in IBM Storage Scale: Administration Guide.
Important: When the performance monitoring tool is used, ensure that the clocks of all of the nodes in
the cluster are synchronized. The Network Time Protocol (NTP) must be configured on all nodes.
Note: The performance monitoring information, which is driven by the IBM Storage Scale internal
monitoring tool and users by using the mmpmon command might affect each other.
Related concepts
Network performance monitoring
Network performance can be monitored either by using Remote Procedure Call (RPC) statistics or it can
be monitored by using the IBM Storage Scale graphical user interface (GUI).
Monitoring I/O performance with the mmpmon command
Use the mmpmon command to monitor the I/O performance of IBM Storage Scale on the node on which it
is run and on other specified nodes.
Viewing and analyzing the performance data
The performance monitoring tool displays the performance metrics that are associated with GPFS and
the associated protocols. It helps you get a graphical representation of the status and trends of the key
performance indicators, and analyze IBM Storage Scale performance problems.
If mmperfmon config show does not show any configuration and no nodes are designated perfmon,
the configuration can be managed manually.
Automated configuration
In the performance monitoring tool, sensors can be configured on nodes that are part of an IBM Storage
Scale cluster through an IBM Storage Scale based configuration mechanism. However, this requires the
execution of the mmchconfig release=LATEST command.
The automated configuration method allows the sensor configuration to be stored as part
of the IBM Storage Scale configuration. Automated configuration is available for the sensor
configuration files, /opt/IBM/zimon/ZIMonSensors.cfg, and partly for the collector configuration
files, /opt/IBM/zimon/ZIMonCollector.cfg. Only the peers section for federation is available for
the collector configuration files. In this setup, the /opt/IBM/zimon/ZIMonSensors.cfg configuration
file on each IBM Storage Scale node is maintained by IBM Storage Scale. As a result, the file must not be
edited manually because whenever IBM Storage Scale needs to update a configuration parameter, the file
is regenerated and any manual modifications are overwritten. Before using the automated configuration,
sensors =
{
name = "CPU"
period = 1
},
{ name = "Load"
period = 1
},
{
name = "Memory"
period = 1
},
{
name = "Network"
period = 1
filter = "eth*"
},
{
name = "Netstat"
period = 1
},
The period in the example specifies the interval size in number of seconds when a sensor group gathers
data. 0 means that the sensor group is disabled and 1 runs the sensor group every second. You can
specify a higher value to decrease the frequency at which the data is collected.
Whenever the configuration file is changed, you must stop and restart the pmsensor daemon by using the
following commands:
1. Issue the systemctl stop pmsensors command to stop (deactivate) the sensor.
2. Issue the systemctl start pmsensors command to restart (activate) the sensor.
Some sensors, such as the cluster export services sensors run on a specific set of nodes. Other sensors,
such as the GPFSDiskCap sensor must run on a single node in the cluster since the data reported is the
same, independent of the node the sensor is running on. For these types of sensors, the restrict function
is especially intended. For example, to restrict a NFSIO sensor to a node class and change the reporting
period to once every 10 hours, you can specify NFSIO.period=36000 NFSIO.restrict=nodeclass1
as attribute value pairs in the update command.
Some sensors, such as VFS, are not enabled by default even though they have associated predefined
queries with the mmperfmon query command. This is so because the collector might display
performance issues of its own if it is required to collect more than 1000000 metrics per second. To enable
VFS sensors, use the mmfsadm vfsstats enable command on the node. To enable a sensor, set the
period value to an integer greater than 0 and restart the sensors on that node by using the systemctl
restart pmsensors command.
{ filter = "netdev_name=veth.*"
name = "Network"
period = 1},
{filter = "mountPoint=/var/lib/docker.*"
name = "DiskFree"
period = 600}
You can update the filter using the mmperfmon config update command.
sensors = {
name = "NFSIO"
period = 0
restrict = "cesNodes"
type = "Generic"
},
{
name = "SMBStats"
period = 1
restrict = "cesNodes"
type = "Generic"
}
Ensure that the sensors are added and listed as part of the performance monitoring configuration. Run the
following command to add the sensor to the configuration:
If any of the sensors mentioned in the file exist already, they are mentioned in the output for the
command and those sensors are ignored, and the existing sensor configuration is kept. After the sensor
is added to the configuration file, its configuration settings can be updated using mmperfmon config
update command.
Run the following command to delete a sensor from the configuration:
Note: There are two new sensors, GPFSPool and GPFSFileset for the pmsensor service. If an older
version of the IBM Storage Scale performance monitoring system is upgraded, these sensors are not
automatically enabled. This is because automatically enabling the sensors might cause the collectors to
consume more main memory than what was set aside for monitoring. Changing the memory footprint
of the collector database might cause issues for the users if the collectors are tightly configured.
For information on how to manually configure the performance monitoring system (file-managed
configuration), see the Manual configuration section in the IBM Storage Scale: Administration Guide.
Related reference
“List of performance metrics” on page 117
Recommendations
• System entities that do not exist anymore leave their key signature and their historical data in the
collector database.
For example, if you delete a node, file system, file set, disk, and other resources from the cluster,
the key and data of the removed entity remains within the collector database. The same is valid if
you rename an entity (for example, rename a node). If your system entities are often short-living
and for temporary usage only, such entities might cause a high impact on the pmcollector memory
consumption.
• You must check your system for such expired keys from time to time and delete the expired keys by
running the following commands to list the expired keys first and delete afterward.
For more information about command usage, see mmperfmon command in the IBM Storage Scale:
Command and Programming Reference Guide.
Note: The historical data of the removed entities is not removed from the database automatically
with immediate effect because it is possible that the customer might need to query the
corresponding historical data later.
Configuring the initial sensor data poll delay for long periods
This section provides information on how to configure the initial sensor data poll delay for long periods.
Some performance monitoring sensors might invoke heavy-load tasks like executing the tsdf command.
The scheduled period of such sensors can be as high as once per hour or once per day, and the heavy-
load tasks can force the Performance Monitoring Tool (PMT) to change its configuration frequently. When
the PMT configuration changes frequently, the system triggers the sensors to restart at every instance.
domains = {
# this is the raw domain, aggregation factor for the raw domain is always 0
aggregation = 0
duration = "12h"
filesize = "1g" # maximum file size
files = 5 # number of files.
},
{
# this is the second domain that aggregates to 60 seconds
aggregation = 60
duration = "4w"
filesize = "500m" # maximum file size
files = 4 # number of files.
},
The configuration file lists several data domains. At least one domain must be present, and the first
domain represents the raw data collection as the data is collected by sensors. The aggregation parameter
for this first domain must be set to 0.
Each domain specifies the following parameters:
• The duration parameter indicates the time period until the collected metrics are pushed into the next
(coarser-grained) domain. If this option is left out, no limit on the duration is imposed. The units are
seconds, hours, days, weeks, months, and years { s, h, d, w, m, y }.
• The filesize and files parameter indicates how much space is allocated on disk for a specific
domain. When metrics are stored in memory, a persistence mechanism is in place that also stores
the metrics on disk in files of size filesize. Once the number of files is reached and a new file is
to be allocated, the oldest file is removed from the disk. The persistent storage must be at least as
large as the amount of main memory to be allocated for a domain. The size of the persistent storage is
significant because when the collector is restarted, the in-memory database is re-created from these
files.
If both the ram and the duration parameters are specified, both constraints are active at the same
time. As soon as one of the constraints is impacted, the collected metrics are pushed into the next
(coarser-grained) domain.
The aggregation value, which is used for the second and following domains, indicates the resampling to
be performed. When data is spilled into this domain, it is resampled to be no better than indicated by the
aggregation factor. The value for the second domain is in seconds, the value for domain n (n >2) is the
value of domain n-1 multiplied by the aggregation value of domain n.
The collector collects the metrics from the sensors. Depending on the number of nodes and metrics that
are collected, the collector requires a different amount of main memory to store the collected metrics in
the memory. For example, in a five-node cluster that reports only the load values (load1, load5, load15) ,
the collector maintains 15 metrics (three metrics times five nodes).
The collectors can be stopped (deactivated) by issuing the systemctl stop pmcollector command.
The collectors can be started (activated) by issuing the systemctl start pmcollector command.
peers = {
host = "collector1.mydomain.com"
port = "9085"
}, {
host = "collector2.mydomain.com"
port = "9085"
}
The port number is the one specified by the federationport configuration option, typically set to 9085.
It is acceptable to list the current host so that the same configuration file can be used for all the collector
machines.
After the peers are specified, a query for measurement data can be directed to any of the collectors listed
in the peers section. Also, the collectors collect and assemble a response that is based on all relevant
data from all collectors. Hence, clients only need to contact a single collector to get all the measurements
available in the system.
To distribute the measurement data reported by sensors over multiple collectors, multiple collectors
might be specified when automatically configuring the sensors, as shown in the following sample:
prompt# mmperfmon config generate \
--collectors collector1.domain.com,collector2.domain.com,…
If multiple collectors are specified, then the federation between these collectors is configured
automatically. The peers section in those collectors' configuration files, /opt/IBM/zimon/
ZIMonCollector.cfg, is also updated. The sensors pick one of the many collectors to report their
measurement data to. The sensors use stable hashes to pick the collector such that the sensor-collector
relationship does not change too much when new collectors are added or when a collector is removed.
Additionally, sensors and collectors can be configured for high availability. To maintain high availability,
each metric is sent to two collectors in case one collector becomes unavailable. In this setting, sensors
report their measurement data to more than one collector so that the failure of a single collector would
not lead to any data loss. For instance, if the collector redundancy is increased to two, then every sensor
reports to two collectors. As a side-effect of increasing the redundancy to two, the bandwidth that is used
for reporting measurement data is duplicated. The collector redundancy must be configured before the
sensor configuration is stored in GPFS by changing the colRedundancy option in /opt/IBM/zimon/
defaults/ZIMonSensors.cfg as explained in the “Configuring the sensor” on page 107 section.
Note: The federation interconnections require the IP address of peer collectors to be reverse-resolvable
to the long daemon name, which is used for the mmperfmon config --collectors option.
3. Disable the service on the old collector node A by using the following command:
4. Change the peers section in the /opt/IBM/zimon/ZIMonCollector.cfg file on all the collector
nodes so that the new collector node B is included, and the old collector node A is removed.
Note: You must not edit the other collectors in the peers section.
5. Change the collector setting for the sensors by using the following command:
7. Move the complete data folder with its sub-folders from the old collector node A to the new collector
node B by using the following command:
8. Start the pmcollector.service on the new collector node B by using the following command:
Linux metrics
All network and general metrics are native without any computed metrics. The following section lists all
the Linux metrics:
CPU
The following section lists information about CPU in the system. For example, myMachine|CPU|
cpu_user.
• cpu_contexts: Number of context switches across all CPU cores.
• cpu_guest: Percentage of total CPU spent running a guest OS. Included in cpu_user.
• cpu_guest_nice: Percentage of total CPU spent running as nice guest OS. Included in cpu_nice.
DiskFree
The following section lists information about the free disk. Each mounted directory has a separate
section. For example, myMachine|DiskFree|myMount|df_free.
• df_free: Amount of free disk space on the file system.
• df_total: Amount of total disk space on the file system.
• df_used: Amount of used disk space on the file system.
Diskstat
The following section lists information about the disk status for each of the disks. For example,
myMachine|Diskstat|myDisk|disk_active_ios.
• disk_active_ios: Number of I/O operations currently in progress.
• disk_aveq: Weighted number of milliseconds spent doing I/Os.
• disk_io_time: Number of milliseconds the system spent doing I/O operation.
• disk_read_ios: Total number of read operations completed successfully.
• disk_read_merged: Number of (small) read operations that are merged into a larger read.
• disk_read_sect: Number of sectors read.
• disk_read_time: Amount of time in milliseconds spent reading.
• disk_write_ios: Number of write operations completed successfully.
• disk_write_merged: Number of (small) write operations that are merged into a larger write.
• disk_write_sect: Number of sectors written.
• disk_write_time: Amount of time in milliseconds spent writing.
Load
The following section lists information about the load statistics for a particular node. For example,
myMachine|Load|jobs.
• jobs: The total number of jobs that currently exist in the system.
• load1: The average load (number of jobs in the run queue) over the last minute.
• load15: The average load (number of jobs in the run queue) over the last 15 minutes.
• load5: The average load (number of jobs in the run queue) over the 5 minutes.
Memory
The following section lists information about the memory statistics for a particular node. For example,
myMachine|Memory|mem_active.
• mem_active: Active memory that was recently accessed.
Netstat
The following section lists information about the network status for a particular node. For example,
myMachine|Netstat|ns_remote_bytes_r.
• ns_closewait: Number of connections in the TCP_CLOSE_WAIT state.
• ns_established: Number of connections in the TCP_ESTABLISHED state.
• ns_listen: Number of connections in the TCP_LISTEN state.
• ns_local_bytes_r: Number of bytes received from a local node to a local node.
• ns_local_bytes_s: Number of bytes sent from a local node to a local node.
• ns_localconn: Number of local connections from a local node to a local node.
• ns_remote_bytes_r: Number of bytes sent from a local node to a remote node.
• ns_remote_bytes_s: Number of bytes sent from a remote node to a local node.
• ns_remoteconn: Number of remote connections from a local node to a remote node.
• ns_timewait: Number of connections in the TCP_TIME_WAIT state.
Network
The following section lists information about the network statistics per interface for a particular node. For
example, myMachine|Network|myInterface|netdev_bytes_r.
• netdev_bytes_r: Number of bytes received.
• netdev_bytes_s: Number of bytes sent.
• netdev_carrier: Number of carrier loss events.
• netdev_collisions: Number of collisions.
• netdev_compressed_r: Number of compressed frames received.
• netdev_compressed_s: Number of compressed packets sent.
• netdev_drops_r: Number of packets dropped while receiving.
TopProc sensor
The TopProc sensor collects aggregated resource consumption for the 10 most CPU consuming
processes. The sensor is different from other performance monitoring sensors. The statistics can be
checked by using the mmperfmon report top command instead of with the mmperfmon query
command. For more information about the mmperfmon command, see the mmperfmon command in the
IBM Storage Scale: Command and Programming Reference Guide.
For example,
Time node-21
2023-10-30-18:24:00 mmsysmon.py 28 13
pmcollector 13 8
tuned 12 3
mmfsd 12 107
java 9 49
pmsensors 7 1
vmtoolsd 5 1
kworker/1:1-ev 3 0
systemd 1 1
ksoftirqd/0 1 0
GPFS metrics
The following section lists all the GPFS metrics:
• “GPFSDisk” on page 121
• “GPFSDiskCap” on page 121
• “GPFSFileset” on page 122
• “GPFSFileSystem” on page 122
• “GPFSFileSystemAPI” on page 123
• “GPFSLROC” on page 124
• “GPFSNode” on page 125
• “GPFSNodeAPI” on page 126
• “GPFSNSDDisk” on page 126
• “GPFSNSDFS” on page 127
• “GPFSNSDPool” on page 127
GPFSDisk
For each NSD in the system, for example myMachine|GPFSDisk|myCluster|myFilesystem|myNSD|
gpfs_ds_bytes_read
• gpfs_ds_bytes_read: Number of bytes read.
• gpfs_ds_bytes_written: Number of bytes written.
• gpfs_ds_max_disk_wait_rd: The longest time spent waiting for a disk read operation.
• gpfs_ds_max_disk_wait_wr: The longest time spent waiting for a disk write operation.
• gpfs_ds_max_queue_wait_rd: The longest time between being enqueued for a disk read operation
and the completion of that operation.
• gpfs_ds_max_queue_wait_wr: The longest time between being enqueued for a disk write operation
and the completion of that operation.
• gpfs_ds_min_disk_wait_rd: The shortest time spent waiting for a disk read operation.
• gpfs_ds_min_disk_wait_wr: The shortest time spent waiting for a disk write operation.
• gpfs_ds_min_queue_wait_rd: The shortest time between being enqueued for a disk read operation
and the completion of that operation.
• gpfs_ds_min_queue_wait_wr: The shortest time between being enqueued for a disk write operation
and the completion of that operation.
• gpfs_ds_read_ops: Number of read operations.
• gpfs_ds_tot_disk_wait_rd: The total time in seconds spent waiting for disk read operations.
• gpfs_ds_tot_disk_wait_wr: The total time in seconds spent waiting for disk write operations.
• gpfs_ds_tot_queue_wait_rd: The total time that is spent between being enqueued for a read
operation and the completion of that operation.
• gpfs_ds_tot_queue_wait_wr: The total time that is spent between being enqueued for a write
operation and the completion of that operation.
• gpfs_ds_write_ops: Number of write operations.
GPFSDiskCap
Specifies the available disk space capacity on GPFS file systems per pool and per disk.
The key structure is:
<gpfs_cluster_name>|GPFSDiskCap|<gpfs_fs_name>|<gpfs_diskpool_name>|
<gpfs_disk_name>|<metric_name>
Following are the metrics (metric_name):
• gpfs_disk_disksize: Total size of disk.
GPFSInodeCap
Specifies the available inode capacity on GPFS file systems.
The key structure is:
<gpfs_cluster_name>|GPFSInodeCap|<gpfs_fs_name>|<metric_name>
Following are the metrics (metric_name):
• gpfs_fs_inode_used: Number of used inodes.
• gpfs_fs_inode_free: Number of free inodes.
• gpfs_fs_inode_alloc: Number of allocated inodes.
• gpfs_fs_inode_max: Maximum number of inodes.
Note: The GPFSInodeCap is not an independent sensor in the perfmon config, but a sub-sensor of the
GPFSDiskCap sensor.
GPFSPoolCap
Specifies the available disk space capacity on GPFS file systems per pool and per disk usage type.
The key structure is:
<gpfs_cluster_name>|GPFSPoolCap|<gpfs_fs_name>|<gpfs_diskpool_name>|
<gpfs_disk_usage_name>|<metric_name>
The gpfs_disk_usage_name can be either of the following values:
• dataAndMetadata
• dataOnly
• descOnly
• metadataOnly
Following are the metrics (metric_name):
• gpfs_pool_disksize: Total size of all disks for this usage type.
• gpfs_pool_free_fullkb: Total available disk space in full blocks for this usage type.
• gpfs_pool_free_fragkb: Total available space in fragments for this usage type.
Note: The GPFSPoolCap is not an independent sensor in the perfmon config, but a sub-sensor of the
GPFSDiskCap sensor.
GPFSFileset
For each independent fileset in the file system: Cluster name - GPFSFileset - filesystem name - fileset
name.
For example, myCluster|GPFSFileset|myFilesystem|myFileset|gpfs_fset_maxInodes.
• gpfs_fset_maxInodes: Maximum number of inodes for this independent fileset.
• gpfs_fset_freeInodes: Number of free inodes available for this independent fileset.
• gpfs_fset_allocInodes: Number of inodes allocated for this independent fileset.
GPFSFileSystem
For each file system, for example myMachine|GPFSFilesystem|myCluster|myFilesystem|
gpfs_fs_bytes_read
Note:
The behavior of the minimum and maximum wait time for read and write I/O to disk and the queue
wait time, for example metrics such as *max_disk_wait_wr and *max_queue_wait_wr, has changed.
These metrics are now reset each time a sample is taken.
In the previous releases, these metrics were the minimum or maximum values noted since the start of
the mmfsd daemon and were only reset after the mmfsd daemon was restarted. The maximum and
minimum values are equal to the maximum and minimum value of each read instance in a sensor period
because the metrics are reset each time a sample is taken in a sensor period.
GPFSFileSystemAPI
These metrics give the following information for each file system (application view). For example,
myMachine|GPFSFilesystemAPI|myCluster|myFilesystem|gpfs_fis_bytes_read.
• gpfs_fis_bytes_read: Number of bytes read.
• gpfs_fis_bytes_written: Number of bytes written.
• gpfs_fis_close_calls: Number of close calls.
• gpfs_fis_disks: Number of disks in the file system.
• gpfs_fis_inodes_written: Number of inode updates to disk.
• gpfs_fis_open_calls: Number of open calls.
• gpfs_fis_read_calls: Number of read calls.
• gpfs_fis_readdir_calls: Number of readdir calls.
• gpfs_fis_write_calls: Number of write calls.
GPFSNode
These metrics give the following information for a particular node. For example, myNode|GPFSNode|
gpfs_ns_bytes_read.
• gpfs_ns_bytes_read: Number of bytes read.
• gpfs_ns_bytes_written: Number of bytes written.
• gpfs_ns_clusters: Number of clusters that are participating.
• gpfs_ns_disks: Number of disks in all mounted file systems.
GPFSNodeAPI
These metrics give the following information for a particular node from its application point of view. For
example, myMachine|GPFSNodeAPI|gpfs_is_bytes_read.
• gpfs_is_bytes_read: Number of bytes read.
• gpfs_is_bytes_written: Number of bytes written.
• gpfs_is_close_calls: Number of close calls.
• gpfs_is_inodes_written: Number of inode updates to disk.
• gpfs_is_open_calls: Number of open calls.
• gpfs_is_readDir_calls: Number of readdir calls.
• gpfs_is_read_calls: Number of read calls.
• gpfs_is_write_calls: Number of write calls.
GPFSNSDDisk
These metrics give the following information about each NSD disk on the NSD server. For example,
myMachine|GPFSNSDDisk|myNSDDisk|gpfs_nsdds_bytes_read.
• gpfs_nsdds_bytes_read: Number of bytes read.
GPFSNSDFS
These metrics give the following information for each file system served by a specific NSD server. For
example, myMachine|GPFSNSDFS|myFilesystem|gpfs_nsdfs_bytes_read.
• gpfs_nsdfs_bytes_read: Numbers of NSD bytes read, aggregated to the file system.
• gpfs_nsdfs_bytes_written: Numbers of NSD bytes written, aggregated to the file system.
• gpfs_nsdfs_read_ops: Numbers of NSD read operations, aggregated to the file system.
• gpfs_nsdfs_write_ops: Numbers of NSD write operations, aggregated to the file system.
GPFSNSDPool
These metrics give the following information for each file system and pool that is served
by a specific NSD server. For example, myMachine|GPFSNSDPool|myFilesystem|myPool|
gpfs_nsdpool_bytes_read.
• gpfs_nsdpool_bytes_read: Numbers of NSD bytes read, aggregated to the file system.
• gpfs_nsdpool_bytes_written: Numbers of NSD bytes written, aggregated to the file system.
• gpfs_nsdpool_read_ops: Numbers of NSD read operations, aggregated to the file system.
• gpfs_nsdpool_write_ops: Numbers of NSD write operations, aggregated to the file system.
GPFSPool
For each pool in each file system: Cluster name - GPFSPool - filesystem name -pool name.
For example, myCluster|GPFSPool|myFilesystem|myPool|gpfs_pool_free_dataKBvalid*.
• gpfs_pool_free_dataKB: Free capacity for data (in KB) in the pool.
• gpfs_pool_total_dataKB: Total capacity for data (in KB) in the pool.
• gpfs_pool_free_metaKB: Free capacity for metadata (in KB) in the pool.
• gpfs_pool_total_metaKB: Total capacity for metadata (in KB) in the pool.
GPFSPoolIO
These metrics give the details about each cluster, file system, and pool in the system, from the point of
view of a specific node. For example, myMachine|GPFSPoolIO|myCluster|myFilesystem|myPool|
gpfs_pool_bytes_rd
• gpfs_pool_bytes_rd: Bytes read from the pool.
• gpfs_pool_bytes_wr: Bytes written to the pool.
• gpfs_pool_free_fragkb: Total available space in fragments for this usage type.
GPFSQoS
These metrics give the following information for each QoS class in the system: Cluster name -
GPFSQoS - filesystem name - storage pool name - fileset name. For example, myCluster|GPFSQoS|
myFilesystem|data|misc|myFileset|gpfs_qos_iops.
GPFSVFSX
The GPFSVFSX sensor provides virtual file system operations statistics, including the number of
operations and their average, minimum and maximum latency metrics.
Some performance monitoring sensors, such as VFS, are not enabled by default, even though they have
predefined queries that are associated with the mmperfmon query command. This happens because the
collector might have performance issues when it is required to collect more than a million metrics per
second.
In order to enable the VFS statistics and sensor, use the mmfsadm vfsstats enable command on the
node. Similarly, to enable the GPFSVFSX sensor, set the period value to an integer n, which is greater than
zero by using the following command:
Later, the VFS statistics can be disabled by using the mmfsadm vfsstats disable command, and the
GPFSVFSX sensor can be disabled by using the following command:
External calls to the mmpmon vfssx interfere with the GPFSVFSX sensor. As the mmpmon command
resets the vfssx metrics after every call, it might cause both the GPFSVFSX sensor and the external caller
to retrieve inaccurate data.
Note:
The GPFSVFSX read and write operation time metrics do not provide meaningful minimum and
maximum values. The gpfs_vfsx_read_tmin and gpfs_vfsx_read_tmax have the same values
as the summarized time, which is gpfs_vfsx_read_t. Similarly, the gpfs_vfsx_write_tmin,
gpfs_vfsx_write_tmax, and gpfs_vfsx_write_t have the same values.
This is due to the aggregation of various read sub-operations, such as gpfs_f_read, gpfs_f_readv,
and gpfs_f_aio_read (that is completed synchronously) into gpfs_vfsx_read. And also, due
to the aggregation of various write sub-operations, such as gpfs_f_write, gpfs_f_writev, and
gpfs_f_aio_write (that is completed synchronously) into gpfs_vfsx_write.
GPFSVIO64
These metrics provide details of the virtual I/O server (VIOS) operations, where VIOS is supported.
Note: GPFSVIO64 is a replacement for GPFSVIO sensor and uses 64-bit counters.
• gpfs_vio64_fixitOps: Number of VIO fix strips operations with read medium error.
• gpfs_vio64_flushUpWrOps: Numbers of VIO flush update operations.
• gpfs_vio64_flushPFTWOps: Numbers of VIO flush promoted full-track write operations.
• gpfs_vio64_forceConsOps: Number of VIO force consistency operations.
• gpfs_vio64_FTWOps: Numbers of VIO full-track write operations.
• gpfs_vio64_logTipReadOps: Number of VIO log-tip read operations.
• gpfs_vio64_logHomeReadOps: Number of VIO log-home read operations.
• gpfs_vio64_logWriteOps: Number of VIO logs write operations.
• gpfs_vio64_medWriteOps: Number of VIO mediums write operations.
• gpfs_vio64_metaWriteOps: Number of recovery group metadata write operations.
• gpfs_vio64_migrateTrimOps: Number of migrate trim operations.
• gpfs_vio64_migratedOps: Number of VIO strip migration operations.
• gpfs_vio64_promFTWOps: Number of VIO promoted full-track write operations.
• gpfs_vio64_ptrackTrimOps: Number of ptrack trim operations.
• gpfs_vio64_readCacheHit: Numbers of cache hit with read operations.
• gpfs_vio64_readCacheMiss: Number of caches miss with read operations.
• gpfs_vio64_readOps: Number of VIO read operations.
• gpfs_vio64_RGDWriteOps: Number of recovery group descriptor write operations.
• gpfs_vio64_scrubOps: Number of VIO scrub operations.
• gpfs_vio64_shortWriteOps: Numbers of VIO short write operations.
• gpfs_vio64_vtrackTrimOps: Number of vtrack trim operations.
Note: To report the new sensor data, the pmcollector must have the same or a higher code version than
the pmsensors module. Otherwise, it ignores the data of the new sensors.
GPFSFilesetQuota
The following metrics provide details of a fileset quota:
• gpfs_rq_blk_current: Number of kilobytes currently in use.
• gpfs_rq_blk_soft_limit: Assigned soft quota limit.
• gpfs_rq_blk_hard_limit: Assigned hard quota limit.
• gpfs_rq_blk_in_doubt: Number of kilobytes in-doubt, availability not yet resolved.
• gpfs_rq_file_current: Number of files (inodes) currently in use.
• gpfs_rq_file_soft_limit: Assigned soft quota limit.
• gpfs_rq_file_hard_limit: Assigned hard quota limit.
• gpfs_rq_file_in_doubt: Number of files (inodes) in-doubt, availability not yet resolved.
GPFSBufMgr
The following metric provides the current size of a page pool:
• gpfs_bufm_tot_poolSizeK: Total size of the page pool.
Note: To activate the sensor on upgraded systems, run the mmperfmon config add --
sensors /opt/IBM/zimon/defaults/ZIMonSensors_GPFSBufMgr.cfg command.
GPFSRPCS
Each metric that the GPFSRPCS sensor provides for a node is combined over all the peers to which the
node is connected. An average is a weighted average over all the peer connections. A minimum is the
minimum of the minimum for all the peer connections, and a maximum is the maximum of the maximum
for all the peer connections.
The GPFSRPCS metrics are:
• gpfs_rpcs_chn_av: The average amount of time the RPC must wait for access to a communication
channel to the target node.
Computed Metrics
These metrics can be used only through the mmperfmon query command. The following metrics are
computed for GPFS:
• gpfs_create_avg_lat (latency): gpfs_vfs_create_t / gpfs_vfs_create
• gpfs_read_avg_lat (latency): gpfs_vfs_read_t / gpfs_vfs_read
• gpfs_remove_avg_lat (latency): gpfs_vfs_remove_t / gpfs_vfs_remove
• gpfs_write_avg_lat (latency): gpfs_vfs_write_t / gpfs_vfs_write
Important: The performance monitoring information that is driven by the IBM Storage Scale's internal
monitoring tool and driven by the users who use the mmpmon command might affect each other.
AFM metrics
You can use AFM metrics only when GPFS is configured on your system. The following section lists all the
AFM metrics.
GPFSAFM
• gpfs_afm_avg_time: Average time in seconds that a pending operation waited in the gateway queue
before it is sent to remote system.
• gpfs_afm_bytes_pending: Total number of bytes pending, which are not yet written to the remote
system.
NFS metrics
The following section lists all the NFS metrics.
NFSIO
• nfs_read_req: Number of bytes that are requested for reading.
• nfs_write_req: Number of bytes that are requested for writing.
• nfs_read: Number of bytes that are transferred for reading.
• nfs_write: Number of bytes that are transferred for writing.
• nfs_read_ops: Number of total read operations.
• nfs_write_ops: Number of total write operations.
• nfs_read_err: Number of erroneous read operations.
• nfs_write_err: Number of erroneous write operations.
• nfs_read_lat: Time that is used by read operations (in nanoseconds).
• nfs_write_lat: Time that is used by write operations (in nanoseconds).
• nfs_read_queue: Time that is spent in the RPC waiting queue.
• nfs_write_queue: Time that is spent in the RPC waiting queue.
Computed metrics
The following metrics are computed for NFS and can be used only with the mmperfmon query
command.
• nfs_total_ops: nfs_read_ops + nfs_write_ops
• nfsIOlatencyRead: (nfs_read_lat + nfs_read_queue) / nfs_read_ops
• nfsIOlatencyWrite: (nfs_write_lat + nfs_write_queue) / nfs_write_ops
• nfsReadOpThroughput: nfs_read/nfs_read_ops
• nfsWriteOpThroughput: nfs_write/nfs_write_ops
Object metrics
The following section lists all the object metrics:
Important:
• CES Swift Object protocol feature is not supported from IBM Storage Scale 5.1.9 onwards.
• IBM Storage Scale 5.1.8 is the last release that has CES Swift Object protocol.
• IBM Storage Scale 5.1.9 will tolerate the update of a CES node from IBM Storage Scale 5.1.8.
– Tolerate means:
- The CES node will be updated to 5.1.9.
- Swift Object support will not be updated as part of the 5.1.9 update.
- You may continue to use the version of Swift Object protocol that was provided in IBM Storage
Scale 5.1.8 on the CES 5.1.9 node.
- IBM will provide usage and known defect support for the version of Swift Object that was provided
in IBM Storage Scale 5.1.8 until you migrate to a supported object solution that IBM Storage Scale
provides.
• Please contact IBM for further details and migration planning.
SwiftContainer
• container_auditor_time: Timing the data for each container audit.
• container_DEL_err_time: Timing the data for DELETE request errors like bad request, not
mounted, missing timestamp, or conflict.
• container_DEL_time: Timing the data for each DELETE request, which does not result in an error.
• container_GET_err_time: Timing the data for GET request errors like bad request, not mounted,
parameters not utf8, or bad accept header.
• container_GET_time: Timing data for each GET request, which does not result in an error.
• container_HEAD_err_time: Timing the data for HEAD request errors like bad request or not
mounted.
• container_HEAD_time: Timing the data for each HEAD request, which does not result in an error.
• container_POST_err_time: Timing the data for POST request errors like bad request, bad x-
container-sync-to, or not mounted.
• container_POST_time: Timing the data for each POST request, which does not result in an error.
• container_PUT_err_time: Timing the data for PUT request errors like bad request, missing
timestamp, not mounted, or conflict.
• container_PUT_time: Timing the data for each PUT request, which does not result in an error.
• container_REPLICATE_err_time: Timing the data for REPLICATE request errors like bad request
or not mounted.
• container_REPLICATE_time: Timing the data for each REPLICATE request, which does not result in
an error.
SwiftObject
• object_auditor_time: Timing the data for each object audit (does not include any rate-
limiting sleep time for max_files_per_second, but does include rate-limiting sleep time for
max_bytes_per_second).
• object_DEL_err_time: Timing the data for DELETE request errors like bad request, missing
timestamp, not mounted, or precondition that failed. Includes requests, which did not find or match
the object.
• object_DEL_time: Timing the data for each DELETE request, which does not result in an error.
• object_expirer_time: Timing the data for each object expiration attempt that includes ones, which
result in an error.
• object_GET_err_time: Timing the data for GET request errors like bad request, not mounted, header
timestamps before the epoch, or precondition failed. File errors, which result in a quarantine, are not
counted here.
• object_GET_time: Timing the data for each GET request, which did not result in an error. Includes
requests, which did not find the object, such as disk errors that result in file quarantine.
• object_HEAD_err_time: Timing the data for HEAD request errors like bad request or not mounted.
• object_HEAD_time: Timing the data for each HEAD request, which did not result in an error. Includes
requests, which did not find the object, such as disk errors that result in file quarantine.
• object_POST_err_time: Timing the data for POST request errors like bad request, missing
timestamp, delete-at in past, or not mounted.
• object_POST_time: Timing the data for each POST request, which did not result in an error.
• object_PUT_err_time: Timing the data for PUT request errors like bad request, not mounted,
missing timestamp, object creation constraint violation, or delete-at in past.
• object_PUT_time: Timing the data for each PUT request, which did not result in an error.
• object_REPLICATE_err_time: Timing the data for REPLICATE request errors like bad request or not
mounted.
• object_REPLICATE_time: Timing the data for each REPLICATE request, which did not result in an
error.
• object_replicator_partition_delete_time: Timing the data for partitions that are replicated
to another node because they do not belong to this node. This metric is not tracked per device.
• object_replicator_partition_update_time: Timing the data for partitions replicated that also
belong on this node. This metric is not tracked per device.
• object_updater_time: Timing the data for object sweeps to flush async_pending container updates.
It does not include object sweeps that did not find an existing async_pending storage directory.
SwiftProxy
• proxy_account_GET_bytes: The sum of bytes that are transferred in (from clients) and out (to
clients) for requests 200, which is a standard response for successful HTTP requests.
• proxy_account_GET_time: Timing the data for GET request, start to finish, 200, which is a standard
response for successful HTTP requests.
SMB metrics
The following section lists all the SMB metrics.
SMBGlobalStats
• connect count: Number of connections since start of the parent smbd process.
SMB2 metrics
The SMB2 metrics are available for all SMB2 requests, such as create, read, write, and find.
• op_count: Number of times the corresponding SMB request is called.
• op_idle
– for notify: Time that is taken between a notification request and sending a corresponding notification.
– for oplock breaks: Time that is waiting until an oplock is broken.
– For all others, the value is always zero.
• op_inbytes: Number of bytes that are received for the corresponding request that includes protocol
headers.
• op_outbytes: Number of bytes that are sent for the corresponding request that includes protocol
headers.
• op_time: The total amount of time that is spent for all corresponding SMB2 requests.
CTDB metrics
The following section lists all the CTDB metrics:
• CTDB version: Version of the CTDB protocol used by the node.
• Current time of statistics: Time when the statistics are generated. This is useful when collecting
statistics output periodically for post-processing.
• Statistics collected since: Time when CTDB was started or the last time statistics was reset. The
output shows the duration and the timestamp.
• num_clients: Number of processes currently connected to CTDB's UNIX socket. This includes recovery
daemon, CTDB tool and SMB processes (smbd, winbindd).
Cloud services
• mcs_total_bytes: Total number of bytes that are uploaded to or downloaded from the cloud storage
tier.
• mcs_total_failed_operations: The total number of failed PUT or GET operations.
• mcs_total_failed_requests: Total number of failed migration, recall, or remove requests.
ESS metrics
The following section lists the GPFSFCM metrics for ESS and IBM Storage Scale:
• “GPFSFCM” on page 148
GPFSFCM
• gpfs_fcm_pdisk_capacity: The capacity that is calculated by GNR but not obtained directly from
the disk. Usually a little less than the size of the disk. In bytes.
Note:
The GPFSFCM sensor in the default configuration is set to disabled when you install the sensor first time.
To enable the sensor on systems with FCM3 devices, issue the following commands:
GPFSFabricHospital
The sensor provides the following three GPFSFabricHospital metrics:
• gpfs_fabhospital_totalIOCount: Number of I/O operations done on this path. This number
includes failing and successful I/O.
• gpfs_fabhospital_errorIOCount: Number of I/O errors encountered by the Linux block layer on
this path.
• gpfs_fabhospital_deviceErrorIOCount: Number of I/O device errors found by the disk hospital
during diagnosing I/O errors.
Currently, the calculation of the sum, average, count, minimum, and maximum is only applicable for the
following object metrics:
• account_HEAD_time
• account_GET_time
• account_PUT_time
• account_POST_time
• account_DEL_time
• container_HEAD_time
• container_GET_time
• container_PUT_time
• container_POST_time
• container_DEL_time
• object_HEAD_time
• object_GET_time
• object_PUT_time
• object_POST_time
• object_DEL_time
• proxy_account_latency
Use the following command to run a objObj query for object metrics. This command calculates and
prints the sum, average, count, minimum, and maximum of metric data for the object objObj for the
metrics mentioned.
mmperfmon query objObj 2016-09-28-09:56:39 2016-09-28-09:56:43
1: cluster1.ibm.com|SwiftObject|object_auditor_time
2: cluster1.ibm.com|SwiftObject|object_expirer_time
3: cluster1.ibm.com|SwiftObject|object_replication_partition_delete_time
4: cluster1.ibm.com|SwiftObject|object_replication_partition_update_time
5: cluster1.ibm.com|SwiftObject|object_DEL_time
6: cluster1.ibm.com|SwiftObject|object_DEL_err_time
7: cluster1.ibm.com|SwiftObject|object_GET_time
8: cluster1.ibm.com|SwiftObject|object_GET_err_time
9: cluster1.ibm.com|SwiftObject|object_HEAD_time
{ name = "CTDBDBStats"
period = 1
type = "Generic"
},
{ name = "CTDBStats"
period = 1
type = "Generic"
},
{
# NFS Ganesha statistics
name = "NFSIO"
period = 1
type = "Generic"
},
{ name = "SMBGlobalStats"
period = 1
{ name = "SMBStats"
period = 1
type = "Generic"
},
At the time of installation, the object metrics proxy is configured to start by default on each Object
protocol node.
The object metrics proxy server, pmswiftd is controlled by the corresponding service script called
pmswiftd, located at /etc/rc.d/init.d/pmswiftd.service. You can start and stop the pmswiftd
service script by using the systemctl start pmswiftd and systemctl stop pmswiftd
commands respectively. You can also view the status of the pmswiftd service script by using the
systemctl status pmswiftd command.
In a system restart, the object metrics proxy server restarts automatically. In case of a failover, the server
starts automatically. If for some reason this does not occur, then the server must be started manually
using the systemctl start pmswiftd command.
The following table shows you resource types and responsible sensors that are included in the
detectability validation procedure.
Table 31. Resource types and the sensors responsible for them
Resource type Responsible sensors
Filesets data GPFSFileset , GPFSFilesetQuota
Filesystem inodes data GPFSInodeCap
Pools data GPFSPool, GPFSPoolCap
Filesystem mounts data DiskFree, GPFSFilesystem, GPFSFilesystemAPI
Disks and NSD data GPFSDiskCap, GPFSNSDDisk
Nodes data CPU
GPFSNode
GPFSNodeAPI
GPFSRPCS
GPFSVFS
Load
Memory
Netstat
SwiftAccount
SwiftContainer
SwiftObject
SwiftProxy
Note: The identifiers from Network, Protocols, TCT, and CTDB sensor data are not included in the
detectability validation and cleanup procedure.
2If total memory is smaller than 40 GB use real value for percentage, otherwise calculate percentage relative to 40GB.
3If the afmHardMemThreshold value is not set, then the default value of 8G is used.
4nfs_read_ops+nfs_write_ops
5computation: write|op_count+close|op_count
61.0*nfs_write_ops/(op_count+nfs_write_ops)
7nfs_write_ops/(write|op_count+nfs_write_ops)
8op_count/(op_count+nfs_read_ops)
9op_count/(op_count+nfs_write_ops)
10gpfs_vfs_write_t/gpfs_vfs_write
11gpfs_vfs_read_t/gpfs_vfs_read
13gpfs_vfs_remove_t/gpfs_vfs_remo
Table 33. Performance monitoring options available in IBM Storage Scale GUI
Option Function
Monitoring > Statistics Displays performance of system resources and file
and object storage in various performance charts.
You can select the necessary charts and monitor
the performance based on the filter criteria.
The pre-defined performance widgets and metrics
help in investigating every node or any particular
node that is collecting the metrics.
The Statistics page is used for selecting the attributes based on which the performance of the system
needs to be monitored and comparing the performance based on the selected metrics. You can also mark
charts as favorite charts and these charts become available for selection when you add widgets in the
dashboard. You can display only two charts at a time in the Statistics page.
Favorite charts that are defined in the Statistics page and the predefined charts are available for selection
in the Dashboard.
You can configure the system to monitor the performance of the following functional areas in the system:
• Network
• System resources
• NSD server
• IBM Storage Scale client
You can use the Services > Performance Monitoring page to configure sensors. You can also use the
mmperfmon command to configure the performance data collection through the CLI. The GUI displays a
subset of the available metrics that are available in the performance monitoring tool.
If the selected node is in the DEGRADED state, then the CLUSTER_PERF_SENSOR is automatically
reconfigured to another node that is in the HEALTHY state. The performance monitoring service is
restarted on the previous and currently selected nodes. For more information, see Automatic assignment
of single node sensors in IBM Storage Scale: Problem Determination Guide.
Note: If the GPFSDiskCap sensor is frequently restarted, it can negatively impact the system performance.
The GPFSDiskCap sensor can cause a similar impact on the system performance as the mmdf command.
Therefore, to avoid using the @CLUSTER_PERF_SENSOR for any sensor in the restrict field of a single
node sensor until the node stabilizes in the HEALTHY state, it is advisable to use a dedicated healthy
node. If you manually configure the restrict field of the capacity sensors then you must ensure that all
the file systems on the specified node are mounted to record file system-related data, like capacity.
Use the Services > Performance Monitoring page to select the appropriate data collection periods for
these sensors.
For the GPFSDiskCap sensor, the recommended period is 86400, which means once per day. As the
GPFSDiskCap.period sensor runs mmdf command to get the capacity data, it is not recommended to use
a value less than 10800 (every 3 hours). To show fileset capacity information, it is necessary to enable
quota for all file systems where fileset capacity must be monitored. For more information, see the -q
option in the mmchfs command and mmcheckquota command.
To update the sensor configuration for triggering an hourly collection of capacity-based fileset capacity
information, run the mmperfmon command as shown in the following example,
SwiftProxy
GPFSAFM
AFM GPFSAFMFS All nodes
GPFSAFMFSET
MCStoreGPFSStats
Transparent Cloud Tiering MCStoreIcstoreStats Cloud gateway nodes
MCStoreLWEStats
The resource type Waiters are used to monitor the long running file system threads. Waiters are
characterized by the purpose of the corresponding file system threads. For example, an RPC call waiter
that is waiting for Network I/O threads or a waiter that is waiting for a local disk I/O file system operation.
Each waiter has a wait time associated with it and it defines how long the waiter is already waiting. With
some exceptions, long waiters typically indicate that something in the system is not healthy.
The Waiters performance chart shows the aggregation of the total count of waiters of all nodes in the
cluster above a certain threshold. Different thresholds from 100 milliseconds to 60 seconds can be
selected in the list below the aggregation level. By default, the value shown in the graph is the sum
of the number of waiters that exceed threshold in all nodes of the cluster at that point in time. The
filter functionality can be used to display waiters data only for some selected nodes or file systems.
Furthermore, there are separate metrics for different waiter types such as Local Disk I/O, Network I/O,
ThCond, ThMutex, Delay, and Syscall.
You can also monitor the capacity details that are aggregated at the following levels:
• NSD
• Node
• File system
• Pool
• Fileset
• Cluster
The following table lists the sensors that are used for capturing the capacity details.
Layout options
The highly customizable dashboard layout options helps to add or remove widgets and change its display
options. Select Layout Options option from the menu that is available in the upper right corner of the
Dashboard GUI page to change the layout options. While selecting the layout options, you can either
select the basic layouts that are available for selection or create a new layout by selecting an empty
layout as the starting point.
You can also save the dashboard so that it can be used by other users. Select Create Dashboard and
Delete Dashboard options from the menu that is available in the upper right corner of the Dashboard
page to create and delete dashboards respectively. If several GUIs are running by using CCR, saved
dashboards are available on all nodes.
When you open the IBM Storage Scale GUI after the installation or upgrade, you can see the default
dashboards that are shipped with the product. You can further modify or delete the default dashboards to
suit your requirements.
Widget options
Several dashboard widgets can be added in the selected dashboard layout. Select Edit Widgets option
from the menu that is available in the upper right corner of the Dashboard GUI page to edit or remove
widgets in the dashboard. You can also modify the size of the widget in the edit mode. Use the Add
Widget option that is available in the edit mode to add widgets in the dashboard.
The widgets with type Performance lists the charts that are marked as favorite charts in the Statistics
page of the GUI. Favorite charts along with the predefined charts are available for selection when you add
widgets in the dashboard.
Legend:
1: mr-31.localnet.com|Network|eth0|netdev_bytes_r
2: mr-31.localnet.com|Network|eth1|netdev_bytes_r
3: mr-31.localnet.com|Network|lo|netdev_bytes_r
The sensor gets the performance data for the collector and the collector passes it to the performance
monitoring tool to display it in the CLI and GUI. If sensors and collectors are not enabled in the system,
the system does not display the performance data and when you try to query data from a system
resource, it returns an error message. For example, if performance monitoring tools are not configured
properly for the resource type Transparent Cloud Tiering, the system displays the following output while
querying the performance data:
For more information on how to troubleshoot the performance data issues, see Chapter 30, “Performance
issues,” on page 479.
For more information, see mmperform in IBM Storage Scale: Command and Programming Reference Guide.
GPFS
GPFS metric queries give an overall view of the GPFS without considering the protocols.
• gpfsCRUDopsLatency: Retrieves information about the GPFS create, retrieve, update, and delete
operations latency.
• gpfsFSWaits: Retrieves information on the maximum waits for read and write operations for all file
systems.
• gpfsNSDWaits: Retrieves information on the maximum waits for read and write operations for all
disks.
• gpfsNumberOperations: Retrieves the number of operations to the GPFS file system.
• gpfsVFSOpCounts: Retrieves VFS operation counts.
Cross protocol
These queries retrieve information after metrics are compared between different protocols on a particular
node.
• protocolIOLatency: Compares latency per protocol (SMB, NFS, and Object).
• protocolIORate: Retrieves the percentage of total I/O rate per protocol (SMB, NFS, and Object).
• protocolThroughput: Retrieves the percentage of total throughput per protocol (SMB, NFS, and
Object).
NFS
These queries retrieve metrics associated with the NFS protocol.
• nfsIOLatency: Retrieves the NFS I/O latency in nanoseconds.
• nfsIORate: Retrieves the NFS I/O operations per second (NFS IOPS).
• nfsThroughput: Retrieves the NFS throughput in bytes per second.
• nfsErrors: Retrieves the NFS error count for read and write operations.
• nfsQueue: Retrieves the NFS read and write queue latency in nanoseconds.
• nfsThroughputPerOp: Retrieves the NFS read and write throughput per operation in bytes.
Object
• objAcc: Details of the Object account performance.
Retrieved metrics:
– account_auditor_time
– account_reaper_time
SMB
These queries retrieve metrics associated with SMB.
• smb2IOLatency: Retrieves the SMB2 I/O latencies per bucket size (default 1 sec).
• smb2IORate: Retrieves the SMB2 I/O rate in number of operations per bucket size (default 1 sec).
• smb2Throughput: Retrieves the SMB2 Throughput in bytes per bucket size (default 1 sec).
• smb2Writes: Retrieves count, # of idle calls, bytes in and out, and operation time for SMB2 writes.
• smbConnections: Retrieves the number of SMB connections.
CTDB
These queries retrieve metrics associated with CTDB.
• ctdbCallLatency: Retrieves information on the CTDB call latency.
• ctdbHopCountDetails: Retrieves information on the CTDB hop count buckets 0 - 5 for one database.
Or
For more information about the various options available with the mmhealth command, see mmhealth
command in IBM Storage Scale: Command and Programming Reference Guide.
For more information, see GPUDirect Storage troubleshooting topic in IBM Storage Scale: Problem
Determination Guide.
command, or the
command.
See the following table for the explanation of the cache state:
There are no active error events for the component AFM on this node (p7fbn10.gpfs.net).
p7fbn10 Wed Mar 15 04:34:41 1]~# mmhealth node show AFM -Y
mmhealth:State:HEADER:version:reserved:reserved:node:component:entityname:entitytype:status:laststatuschange:
mmhealth:Event:HEADER:version:reserved:reserved:node:component:entityname:entitytype:event:arguments:
activesince:identifier:ishidden:
mmhealth:State:0:1:::p7fbn10.gpfs.net:NODE:p7fbn10.gpfs.net:NODE:DEGRADED:2017-03-11 18%3A48%3A20.600167 EDT:
mmhealth:State:0:1:::p7fbn10.gpfs.net:AFM:p7fbn10.gpfs.net:NODE:HEALTHY:2017-03-11 19%3A56%3A48.834633 EDT:
mmhealth:State:0:1:::p7fbn10.gpfs.net:AFM:fs1/p7fbn10ADR-5:FILESET:HEALTHY:2017-03-11 19%3A56%3A48.834753 EDT:
mmhealth:State:0:1:::p7fbn10.gpfs.net:AFM:fs1/p7fbn10ADR-4:FILESET:HEALTHY:2017-03-11 19%3A56%3A19.086918 EDT:
Use the following mmhealth command to display the health status of all the monitored AFM components
in the cluster:
# mmhealth cluster show AFM
Node name: p7fbn10.gpfs.net
There are no active error events for the component AFM on this node (p7fbn10.gpfs.net).
p7fbn10 Wed Mar 15 04:34:41 1]~# mmhealth node show AFM -Y
mmhealth:State:HEADER:version:reserved:reserved:node:component:entityname:entitytype:status:laststatuschange:
mmhealth:Event:HEADER:version:reserved:reserved:node:component:entityname:entitytype:event:arguments:
activesince:identifier:ishidden:
mmhealth:State:0:1:::p7fbn10.gpfs.net:NODE:p7fbn10.gpfs.net:NODE:DEGRADED:2017-03-11 18%3A48%3A20.600167 EDT:
mmhealth:State:0:1:::p7fbn10.gpfs.net:AFM:p7fbn10.gpfs.net:NODE:HEALTHY:2017-03-11 19%3A56%3A48.834633 EDT:
mmhealth:State:0:1:::p7fbn10.gpfs.net:AFM:fs1/p7fbn10ADR-5:FILESET:HEALTHY:2017-03-11 19%3A56%3A48.834753 EDT:
mmhealth:State:0:1:::p7fbn10.gpfs.net:AFM:fs1/p7fbn10ADR-4:FILESET:HEALTHY:2017-03-11 19%3A56%3A19.086918 EDT:
#mmdelcallback callback3
Monitoring performance
You can use mmperfmon and mmpmon commands to monitor AFM and AFM DR.
This command shows statistics from the time the Gateway is functioning. Every gateway recycle resets
the statistics.
The following example is from an AFM Gateway node. The example shows how many operations of
each type were executed on the gateway node.
3. Enable the monitoring tool on the gateway nodes to set the collection periods to 10 or higher:
Note: You can use the GUI or the Grafana bridge to query collected data.
Monitoring prefetch
You can display the status of an AFM prefetch request by running the mmafmctl prefetch command
without the list-file option.
For example, for file system gpfs1 and fileset iw_1, run the following command:
# mmafmctl gpfs1 prefetch -j iw_1
Fileset Name Async Read (Pending) Async Read (Failed) Async Read (Already Cached) Async Read(Total)
Async Read (Data in Bytes)
------------ -------------------------------------- ------------------ ------------------------
iw_1 11 0 0 11 0
This output displays that there are 11 inodes that must be prefetched Async Read (Pending). When the
job has completed, the status command displays:
# mmafmctl gpfs1 prefetch -j iw_1
Fileset Name Async Read (Pending) Async Read (Failed) Async Read (Already Cached) Async Read(Total)
Async Read (Data in Bytes)
------------ -------------------------------------- ------------------ ------------------------
iw_1 0 0 10 11
• Use the following mmdiag --afm command to display only the specified fileset's relationship:
# mmdiag --afm fileset=cache_fs0:fileset_2
The system displays output similar to -
• Use the following mmdiag --afm command to display detailed gateway-specific attributes:
# mmdiag --afm gw
The system displays output similar to -
• Use the mmdiag --afm command to display all active filesets known to the gateway node:
# mmdiag --afm fileset=all
The system displays output similar to -
This list does not include files that are not cached. If partially-cached files do not exist, an output file is
not created.
4. The custom eviction policy:
The steps to use policies for AFM file eviction are - generate a list of files and run the eviction. This
policy lists all the files that are managed by AFM are not accessed in the last seven days.
RULE 'prefetch-list'
LIST 'toevict'
WHERE CURRENT_TIMESTAMP - ACCESS_TIME > INTERVAL '7' DAYS
AND REGEX(misc_attributes,'[P]') /* only list AFM managed files */
To limit the scope of the policy or to use it on different filesets run mmapplypolicy by using a
directory path instead of a file system name. /usr/lpp/mmfs/bin/mmapplypolicy $path -f
$localworkdir -s $localworkdir -P $sharedworkdir/${policy} -I defer
Use mmafmctl to evict the files: mmafmctl datafs evict --list-file $localworkdir/
list.evict
5. A policy of uncached files:
a. The following example is of a LIST policy that generates a list of uncached files in the cache
directory:
b. An example of a LIST policy that generates a list of files with size and attributes belonging to the
cache fileset is as under - (cacheFset1 is the name of the cache fileset in the example.)
RULE 'all' LIST 'allfiles' FOR FILESET ('cacheFset1') SHOW( '/' || VARCHAR(kb_allocated)
|| '/' || varchar(file_size) || '/' ||
VARCHAR(BLOCKSIZE) || '/' || VARCHAR(MISC_ATTRIBUTES) )
Monitoring AFM and AFM DR configuration and performance in the remote cluster
The IBM Storage Scale GUI can monitor only a single cluster. If you want to monitor the AFM and AFM DR
configuration, health, and performance across clusters, the GUI node of the local cluster must establish
a connection with the GUI node of the remote cluster. By establishing a connection between GUI nodes,
both the clusters can monitor each other. To enable remote monitoring capability among clusters, the GUI
nodes that are communicating each other must be in the same software level.
To establish a connection with the remote cluster, perform the following steps:
1. Perform the following steps on the local cluster to raise the access request:
a. Select the Request Access option that is available under the Outgoing Requests tab to raise the
request for access.
b. In the Request Remote Cluster Access dialog, enter an alias for the remote cluster name and
specify the GUI nodes to which the local GUI node must establish the connection.
command, or the
command.
See the following table for the explanation of the cache state:
Table 41. AFM to cloud object storage states and their description
AFM to cloud Condition Description Healthy or Administrator's
object storage Unhealthy action
fileset state
Inactive The fileset is A AFM to cloud Healthy None
created. object storage
fileset is created,
or operations were
not initiated on
the cluster after
the last daemon
restart.
FlushOnly Operations are Operations have Healthy This is a temporary
queued. not started to state and should
flush. move to Active
when a write is
initiated.
Active The fileset cache is The fileset is ready Healthy None
active. for an operation.
Dirty The fileset is The pending Healthy None
active. changes in the
fileset are not
played on the
could object
storage.
No active error events for the component AFM to cloud object storage on the Node5GW node.
2. To display the health status of all the monitored AFM components in the cluster, use the mmhealth
command.
Monitoring performance
You can use mmperfmon and mmpmon commands to monitor AFM to cloud object storage.
This command shows statistics from the time the gateway is functioning. Every gateway recycle resets
the statistics.
The following example is from a gateway node. The example shows how many operations of each type
were run on the gateway node.
Where:
BytesWritten
The amount of data that is synchronized to the home.
3. Enable the monitoring tool on the gateway nodes to set the collection periods to 10 or higher.
Here, queued and failed objects are shown along with cached objects, which is not fetched from the
cloud object storage and the total data is the data download approximately in Bytes.
2. Upload the objects.
The numbers of objects, that are uploaded, are queued to the gateway and shown under the Queued
field.
Installing Net-SNMP
The SNMP subagent runs on the collector node of the GPFS cluster. The collector node is designated by
the system administrator.
For more information, see “Collector node administration” on page 215.
The Net-SNMP master agent (also called the SNMP daemon, or snmpd) must be installed on the
collector node to communicate with the GPFS subagent and with your SNMP management application.
Net-SNMP is included in most Linux distributions and should be supported by your Linux vendor. Source
and binaries for several platforms are available from the download section of the Net-SNMP website
(www.net-snmp.org/download.html).
Note: Currently, the collector node must run on the Linux operating system. For an up-to-date list of
supported operating systems, specific distributions, and other dependencies, refer to the IBM Storage
Scale FAQ in IBM Documentation.
The GPFS subagent expects to find the following shared object libraries:
Note: TCP Wrappers and OpenSSL are prerequisites and should have been installed when you installed
Net-SNMP.
TCP wrappers have been deprecated from RHEL 7 onwards and is not available from RHEL 8 onwards. You
can use firewalld as a firewall level replacement for TCP Wrappers.
For example, RHEL79 system (ESS legacy nodes running ESS 6.1.5.1):
The installed libraries are found in /lib64 or /usr/lib64 or /usr/local/lib64. They may be installed under
names like libnetsnmp.so.5.1.2. The GPFS subagent expects to find them without the appended version
information in the name. Library installation should create these symbolic links for you, so you rarely need
to create them yourself. You can ensure that symbolic links exist to the versioned name from the plain
name. For example,
# cd /usr/lib64
# ln -s libnetsnmpmibs.so.5.1.2 libnetsnmpmibs.so
Repeat this process for all the libraries listed in this topic.
Note: For possible Linux platform and Net-SNMP version compatibility restrictions, see the IBM Storage
Scale FAQ in IBM Documentation.
Related concepts
Configuring Net-SNMP
The GPFS subagent process connects to the Net-SNMP master agent, snmpd.
Configuring management applications
To configure any SNMP-based management applications, such as Tivoli NetView or Tivoli Netcool,
or others, you must make the GPFS MIB file available on the processor on which the management
application runs.
Installing MIB files on the collector node and management node
The GPFS management information base (MIB) file is found on the collector node in the /usr/lpp/mmfs/
data directory with the name GPFS-MIB.txt.
Collector node administration
Collector node administration includes: assigning, unassigning, and changing collector nodes. You can
also see if a collector node is defined.
Starting and stopping the SNMP subagent
The SNMP subagent is started and stopped automatically.
The management and monitoring subagent
The GPFS SNMP management and monitoring subagent runs under an SNMP master agent such as
Net-SNMP. It handles a portion of the SNMP OID space.
Configuring Net-SNMP
The GPFS subagent process connects to the Net-SNMP master agent, snmpd.
The following entries are required in the snmpd configuration file on the collector node (usually, /etc/
snmp/snmpd.conf):
master agentx
AgentXSocket tcp:localhost:705
trap2sink managementhost
where:
managementhost
Is the host name or IP address of the host to which you want SNMP traps sent.
If your GPFS cluster has a large number of nodes or a large number of file systems for which information
must be collected, you must increase the timeout and retry parameters for communication between the
SNMP master agent and the GPFS subagent to allow time for the volume of information to be transmitted.
The snmpd configuration file entries for this are:
agentXTimeout 60
agentXRetries 10
mibdirs +/usr/lpp/mmfs/data
2. Add the following entry to the snmp.conf file (usually found in the /etc/snmp directory):
mibs +GPFS-MIB
mibs +GPFS-MIB
3. You might need to restart the SNMP management application. Other steps might be necessary to make
the GPFS MIB available to your management application.
Important: If the GPFS MIB is not available to the management application, add the following bold
entries in the snmpd.conf file:
####
# Third, create a view for us to let the group have rights to:
# Make at least snmpwalk -v 1 localhost -c public system fast again.
# name incl/excl subtree mask(optional)
view systemview included .1.3.6.1.2.1.1
view systemview included .1.3.6.1.2.1.25.1.1
view ibm included .1.3.6.1.4.1.2
####
# Finally, grant the group read-only access to the systemview view.
# group context sec.model sec.level prefix read write notif
#access notConfigGroup "" any noauth exact systemview none none
access notConfigGroup "" any noauth exact ibm none none
Related concepts
Installing Net-SNMP
The SNMP subagent runs on the collector node of the GPFS cluster. The collector node is designated by
the system administrator.
Configuring Net-SNMP
The GPFS subagent process connects to the Net-SNMP master agent, snmpd.
Configuring management applications
To configure any SNMP-based management applications, such as Tivoli NetView or Tivoli Netcool,
or others, you must make the GPFS MIB file available on the processor on which the management
application runs.
Installing MIB files on the collector node and management node
The GPFS management information base (MIB) file is found on the collector node in the /usr/lpp/mmfs/
data directory with the name GPFS-MIB.txt.
Starting and stopping the SNMP subagent
The SNMP subagent is started and stopped automatically.
The management and monitoring subagent
MIB objects
Important:
• CES Swift Object protocol feature is not supported from IBM Storage Scale 5.1.9 onwards.
• IBM Storage Scale 5.1.8 is the last release that has CES Swift Object protocol.
• IBM Storage Scale 5.1.9 will tolerate the update of a CES node from IBM Storage Scale 5.1.8.
– Tolerate means:
- The CES node will be updated to 5.1.9.
- Swift Object support will not be updated as part of the 5.1.9 update.
- You may continue to use the version of Swift Object protocol that was provided in IBM Storage
Scale 5.1.8 on the CES 5.1.9 node.
Net-SNMP traps
Traps provide asynchronous notification to the SNMP application when a particular event has been
triggered in GPFS. The following table lists the defined trap types:
mmcallhome run SendFile --file file [--desc DESC | --pmr {xxxxx.yyy.zzz | TSxxxxxxxxx}]
Discuss this procedure with the IBM support before using it.
You can also use the following command to find the exact location of the uploaded packages:
• To view the details of the status of the call home tasks, issue the following command:
Group Task Start Time Updated Time Status RC or Step Package File Name Original Filename
-----------------------------------------------------------------------------------------------------------------------------------------
autoGroup_1 daily 20181203105943.289 20181203110008 success RC=0 /tmp/mmfs/callhome/rsENUploaded/
13445038716695.5_0_3_0.123456...
autoGroup_1.gat_daily.g_daily.
scale.20181203105943289.cl0.DC
autoGroup_1 weekly 20181209031101.186 20181209031122 success RC=0 /tmp/mmfs/callhome/rsENUploaded/
13445038716695.5_0_3_0.123456...
autoGroup_1.gat_weekly.g_weekly.
scale.20181209031101186.cl0.DC
autoGroup_1 sendfile 20181203105920.936 20181203105928 success RC=0 /tmp/mmfs/callhome/rsENUploaded/ /root/stanza.txt
13445038716695.5_0_3_0.123456...
autoGroup_1.NoText.s_file.scale.
20181203105920936.cl0.DC
autoGroup_1 sendfile 20181203110130.732 20181203110138 success RC=0 /tmp/mmfs/callhome/rsENUploaded/ /root/anaconda-ks.cfgg
13445038716695.5_0_3_0.123456...
autoGroup_1.NoText.s_file.scale.
20181203110130732.cl0.DC
• To list the registered tasks for gather-send, issue the following command:
Files > File Systems > View Details > Remote Provides the details of the remote cluster nodes
Nodes where the local file system is mounted.
Files > Filesets The Remote Fileset column in the filesets grid
shows whether the fileset belongs to a remote file
system.
The fileset table also displays the same level of
details for both remote and local filesets. For
example, capacity, parent file system, inodes, AFM
role, snapshots, and so on.
Files > Active File Management When remote monitoring is enabled, you can view
the following AFM details:
• On home and secondary, you can see the AFM
relationships configuration, health status, and
performance values of the Cache and Disaster
Recovery grids.
• On the Overview tab of the detailed view,
the available home and secondary inodes are
available.
• On the Overview tab of the detailed view, the
details such as NFS throughput, IOPs, and
latency details are available, if the protocol is
NFS.
Files > Quotas When remote monitoring is enabled, you can view
quota limits, capacity and inode information for
users, groups and filesets of a file system mounted
from a remote cluster. The user and group name
resolution of the remote cluster is used in this
view. It is not possible to change quota limits on a
file system that is mounted from a remote cluster.
Cluster > Remote Connections Provides the following options:
• Send a connection request to a remote cluster.
• Grant or reject the connection requests received
from remote clusters.
• View the details of the remote clusters that are
connected to the local cluster.
tail -f /<path>/<to>/<audit>/<fileset>/auditLogFile.latest*
for i in {1..10};do touch /<path>/<to>/<audited_device>/file$i;done
There are no active error events for the component FILEAUDITLOG on this node (ibmnode1.ibm.com).
Use the following command to view more details about the producers on a per node basis:
auditp_ok device0 INFO 2018-04-09 15:28:27 Event producer for file system device0 is ok.
auditp_ok device1 INFO 2018-04-09 15:31:26 Event producer for file system device1 is ok.
Use the following command to view the status for the entire cluster:
Use the following command to view more details about each file system that has file audit logging
enabled:
import json
import sys
# read the file line by line and display the relevant events: Event-type, event-time, file-
owner, file-path
i = 0
for line in fn:
obj = json.loads(line)
if i == 0:
print ("\n{:10} {:26} {:6} {}".format("Event", "Event-time", "Owner", "Path"))
print
('---------------------------------------------------------------------------------------------'
)
print ("{:10} {:26} {:6} {}".format(obj["event"], obj["eventTime"], obj["ownerUserId"],
obj["path"]))
i = i + 1
Note: There are many open source JSON parsers. There are currently no restrictions on using other
parsing programs.
3. List the security context of the audit log fileset to verify whether it is set correctly by issuing the
following command:
ls -laZ /ibm/fs0
This command displays the current state of the watch folder components on the defined node.
To see the status across the cluster, use the mmhealth cluster show command:
You can then see a more verbose status of the health with the mmhealth node show watchfolder
-v command:
# mmhealth node show watchfolder -v
To get a more verbose status for a specific watch ID, use the mmwatch <device> status --watch-
id <watchID> -v command. This command shows you the status of that single watch ID and lists up to
10 of the most recent entries from the local node system health database.
There are no active error events for the LOCALCACHE component on this node (nodexyz.example.net).
Use the following command to display the status of LROC by using --verbose:
Use the following command to display the health status of all monitored LROC components in the cluster:
# mmdiag --lroc
Inode objects stored 312454 (1220 MB) recalled 157366 (614 MB) = 50.36 %
Inode objects queried 0 (0 MB) = 0.00 % invalidated 157460 (615 MB)
Inode objects failed to store 6 failed to recall 0 failed to query 0 failed to inval 0
Data objects stored 57412 (188807 MB) recalled 10150641 (40597918 MB) = 17680.35 %
Data objects queried 1 (0 MB) = 100.00 % invalidated 57612 (228070 MB)
Data objects failed to store 30 failed to recall 407 failed to query 0 failed to inval 0
inval no recall 54307
# mmdiag --iohist
For more information, see mmdiag command topic in IBM Storage Scale: Command and Programming
Reference Guide.
For more information, see mmcachectl command topic in IBM Storage Scale: Command and Programming
Reference Guide.
mmbackup:mbackup.sh
DEBUGtsbackup33:
0x002
Specifies that temporary files are to be preserved for later analysis.
0x004
Specifies that all dsmc command output is to be mirrored to STDOUT.
The -d option in the mmbackup command line is equivalent to DEBUGmmbackup=1 .
3. To troubleshoot problems with backup subtask execution, enable debugging in the tsbuhelper
program.
Use the DEBUGtsbuhelper environment variable to enable debugging features in the mmbackup
helper program tsbuhelper.
Note: Available services, telephone numbers, and web links are subject to change without notice.
Before you call
Make sure that you have taken steps to try to solve the problem yourself before you call. Some
suggestions for resolving the problem before calling IBM Support include:
• Check all hardware for issues beforehand.
• Use the troubleshooting information in your system documentation. The troubleshooting section of the
IBM Documentation contains procedures to help you diagnose problems.
To check for technical information, hints, tips, and new device drivers or to submit a request for
information, go to the IBM Storage Scale support website.
Using the documentation
Information about your IBM storage system is available in the documentation that comes with the
product. That documentation includes printed documents, online documents, readme files, and help files
in addition to the IBM Documentation. See the troubleshooting information for diagnostic instructions. To
access this information, go to IBM Storage Scale support website, and follow the instructions. The entire
product documentation is available at IBM Storage Scale documentation.
lslv -l gpfs44lv
Output is similar to this, with the physical volume name in column one.
gpfs44lv:N/A
PV COPIES IN BAND DISTRIBUTION
hdisk8 537:000:000 100% 108:107:107:107:108
In this example, k164n04 and k164n05 are quorum nodes and k164n06 is a non-quorum node.
To change the quorum status of a node, use the mmchnode command. To change one quorum node to
nonquorum, GPFS does not have to be stopped. If you are changing more than one node at the same
time, GPFS needs to be down on all the affected nodes. GPFS does not have to be stopped when changing
nonquorum nodes to quorum nodes, nor does it need to be stopped on nodes that are not affected.
For example, to make k164n05 a non-quorum node, and k164n06 a quorum node, issue these
commands:
To set a node's quorum designation at the time that it is added to the cluster, see mmcrcluster or
mmaddnode command in IBM Storage Scale: Command and Programming Reference Guide.
mmchconfig dataStructureDump=/name_of_some_other_big_file_system
Note: This state information (possibly large amounts of data in the form of GPFS dumps and traces) can
be dumped automatically as part the first failure data capture mechanisms of GPFS, and can accumulate
in the (default /tmp/mmfs) directory that is defined by the dataStructureDump configuration
parameter. It is recommended that a cron job (such as /etc/cron.daily/tmpwatch) be used to
remove dataStructureDump directory data that is older than two weeks, and that such data is collected
(for example, via gpfs.snap) within two weeks of encountering any problem that requires investigation.
Note: You must not remove the contents of the callhome subdirectory in dataStructureDump. For
example, /tmp/mmfs/callhome. Call Home automatically ensures that it does not take up too much
space in the dataStructureDump directory. If you remove the call home files, the copies of the
latest uploads are prematurely removed, which reduces the usability of the mmcallhome command. For
example, mmcallhome status diff.
If indexing GPFS file systems is desired, only one node should run the updatedb command and build the
database in a GPFS file system. If the database is built within a GPFS file system, then it is visible on all
nodes after one node finishes building it.
Why does the offline mmfsck command fail with "Error creating internal
storage"?
Use mmfsck command on the file system manager for storing internal data during a file system scan. The
command fails if the GPFS fails to provide a temporary file of the required size.
The mmfsck command requires some temporary space on the file system manager for storing internal
data during a file system scan. The internal data is placed in the directory specified by the mmfsck
-t command line parameter (/tmp by default). The amount of temporary space that is needed is
proportional to the number of inodes (used and unused) in the file system that is being scanned. If
GPFS is unable to create a temporary file of the required size, then the mmfsck command fails with the
following error message:
YYYY-MM-DD_hh:mm:ss.sss±hhmm
where
YYYY-MM-DD
Is the year, month, and day.
_
Is a separator character.
2016-05-09_15:12:20.603-0500
2016-08-15_07:04:33.078+0200
Logs
This topic describes various logs that are generated in the IBM Storage Scale.
GPFS logs
The GPFS log is a repository of error conditions that are detected on each node, as well as operational
events such as file system mounts. The GPFS log is the first place to look when you start debugging the
abnormal events. As GPFS is a cluster file system, events that occur on one node might affect system
behavior on other nodes and all GPFS logs can have relevant data.
The GPFS log can be found in the /var/adm/ras directory on each node. The GPFS log file is named
mmfs.log.date.nodeName, where date is the time stamp when the instance of GPFS started on the
node and nodeName is the name of the node. The latest GPFS log file can be found by using the symbolic
file name /var/adm/ras/mmfs.log.latest.
The GPFS log from the prior startup of GPFS can be found by using the symbolic file
name /var/adm/ras/mmfs.log.previous. All other files have a time stamp and node name
appended to the file name.
At GPFS startup, log files that are not accessed during the last 10 days are deleted. If you want to save old
log files, then copy them elsewhere.
Many GPFS log messages can be sent to syslog on Linux. The systemLogLevel attribute of the
mmchconfig command determines the GPFS log messages to be sent to the syslog. For more
information, see the mmchconfig command in the IBM Storage Scale: Command and Programming
Reference Guide.
This example shows normal operational messages that appear in the GPFS log file on Linux node:
The mmcommon logRotate command can be used to rotate the GPFS log without shutting down
and restarting the daemon. After the mmcommon logRotate command is issued, /var/adm/ras/
mmfs.log.previous contains the messages that occurred since the previous startup of GPFS or the
last run of the mmcommon logRotate command. The /var/adm/ras/mmfs.log.latest file starts
over at the point in time that the mmcommon logRotate command was run.
Depending on the size and complexity of your system configuration, the amount of time to start GPFS
varies. If you cannot access a file system that is mounted, then examine the log file for error messages.
If the --gather-logs option is not available on your system, you can create your own script to
achieve the same task; use /usr/lpp/mmfs/samples/gatherlogs.samples.sh as an example.
Configuring syslog
On Linux operating systems, syslog typically is enabled by default. On AIX, syslog must be set up and
configured. See the corresponding operating system documentation for details.
Message format
For security, sensitive information such as a password is replaced with asterisks (*) in the audit message.
Audit messages are sent to syslog with an identity of mmfs, a facility code of user, and a severity level of
informational. For more information about the meaning of these terms, see the syslog documentation.
The format of the message depends on the source of the GPFS command:
• Messages about GPFS commands that are entered at the command line have the following format:
where:
CLI
The source of the command. Indicates that the command was entered from the command line.
user_name user_name
The name of the user who entered the command, such as root. The same name appears twice.
AUDIT_TYPE1
The point in the command when the message was sent to syslog. Always EXIT.
AUDIT_TYPE2
The action taken by the command. Always CHANGE.
command
The text of the command.
return_code
The return code of the GPFS command.
• Messages about GPFS commands that are issued by GUI commands have a similar format:
where:
GUI-CLI
The source of the command. Indicates that the command was called by a GUI command.
user_name
The name of the user, such as root.
GUI_user_name
The name of the user who logged on to the GUI.
The remaining fields are the same as in the CLI message.
The following lines are examples from a syslog:
Apr 24 13:56:26 c12c3apv12 mmfs[63655]: CLI root root [EXIT, CHANGE] 'mmchconfig
autoload=yes' RC=0
Apr 24 13:58:42 c12c3apv12 mmfs[65315]: CLI root root [EXIT, CHANGE] 'mmchconfig
deadlockBreakupDelay=300' RC=0
Apr 24 14:04:47 c12c3apv12 mmfs[67384]: CLI root root [EXIT, CHANGE] 'mmchconfig
FIPS1402mode=no' RC=0
The following lines are examples from a syslog where GUI is the originator:
Apr 24 13:56:26 c12c3apv12 mmfs[63655]: GUI-CLI root admin [EXIT, CHANGE] 'mmchconfig
autoload=yes' RC=0
SMB logs
The SMB services write the most important messages to syslog.
The SMB service in IBM Storage Scale writes its log message into the syslog of the CES nodes. Thus, it
needs a working syslog daemon and configuration. An SMB snap expects syslog on CES nodes to be found
in a file in the distribution's default paths. If syslog gets redirected to another location, the customer
should provide the logs in case of support.
With the standard syslog configuration, you can search for the terms such as ctdbd or smbd in
the /var/log/messages file to see the relevant logs. For example:
grep ctdbd /var/log/messages
The system displays output similar to the following example:
May 31 09:11:23 prt002st001 ctdbd: Updated hot key database=locking.tdb key=0x2795c3b1 id=0 hop_count=1
May 31 09:27:33 prt002st001 ctdbd: Updated hot key database=smbXsrv_open_global.tdb key=0x0d0d4abe id=0 hop_count=1
May 31 09:37:17 prt002st001 ctdbd: Updated hot key database=brlock.tdb key=0xc37fe57c id=0 hop_count=1
NFS logs
The clustered export services (CES) NFS server writes log messages in the /var/log/ganesha.log file
at runtime.
Operating system's log rotation facility is used to manage NFS logs. The NFS logs are configured and
enabled during the NFS server packages installation.
The following example shows a sample log file:
# tail -f /var/log/ganesha.log
2018-04-09 11:28:18 : epoch 000100a2 : rh424a : gpfs.ganesha.nfsd-20924[main]
nfs_Init_admin_thread :NFS CB
Log levels can be displayed by using the mmnfs config list | grep LOG_LEVEL command. For
example:
LOG_LEVEL: EVENT
By default, the log level is EVENT. Additionally, the following NFS log levels can also be used; starting from
lowest to highest verbosity:
• FATAL
• MAJ
• CRIT
• WARN
• EVENT
• INFO
• DEBUG
• MID_DEBUG
• FULL_DEBUG
Note: The FULL_DEBUG level increases the size of the log file. Use it in the production mode only if
instructed by the IBM Support.
Increasing the verbosity of the NFS server log impacts the overall NFS I/O performance.
To change the logging to the verbose log level INFO, use the following command:
mmnfs config change LOG_LEVEL=INFO
The system displays output similar to the following example:
NFS Configuration successfully changed. NFS server restarted on all NFS nodes on which NFS
server is running.
This change is cluster-wide and restarts all NFS instances to activate this setting. The log file now displays
more informational messages, for example:
2015-06-03 12:49:31 : epoch 556edba9 : cluster1.ibm.com : ganesha.nfsd-21582[main] nfs_rpc_dispatch_threads
:THREAD :INFO :5 rpc dispatcher threads were started successfully
2015-06-03 12:49:31 : epoch 556edba9 : cluster1.ibm.com : ganesha.nfsd-21582[disp] rpc_dispatcher_thread
# cat /etc/logrotate.d/ganesha
/var/log/ganesha.log {
size 10M
rotate 52
copytruncate
dateformat -%Y%m%d%H%M%S
compress missingok
}
2. Add the logrotate service to the crontab by issuing the following command:
# crontab -u <user> -e
For example, to add the logrotate service for a root user and set it to run every 5 minutes:
After these steps, logrotate is configured to rotate the nfs-ganesha log file based on the specified
size, and the logrotate service is scheduled to run at a desired frequency.
To display the currently configured CES log level, use the following command:
mmces log level
The system displays output similar to the following example:
The log file is /var/adm/ras/mmfs.log.latest. By default, the log level is 0 and other possible values
are 1, 2, and 3. To increase the log level, use the following command:
mmces log level 1
Object logs
There are a number of locations where messages are logged with the object protocol.
Important:
• CES Swift Object protocol feature is not supported from IBM Storage Scale 5.1.9 onwards.
• IBM Storage Scale 5.1.8 is the last release that has CES Swift Object protocol.
• IBM Storage Scale 5.1.9 will tolerate the update of a CES node from IBM Storage Scale 5.1.8.
– Tolerate means:
- The CES node will be updated to 5.1.9.
- Swift Object support will not be updated as part of the 5.1.9 update.
- You may continue to use the version of Swift Object protocol that was provided in IBM Storage
Scale 5.1.8 on the CES 5.1.9 node.
- IBM will provide usage and known defect support for the version of Swift Object that was provided
in IBM Storage Scale 5.1.8 until you migrate to a supported object solution that IBM Storage Scale
provides.
• Please contact IBM for further details and migration planning.
The core object services, proxy, account, container, and object server have their own logging level sets
in their respective configuration files. By default, unified file and object access logging is set to show
messages at or beyond the ERROR level, but can be changed to INFO or DEBUG levels if more detailed
logging information is needed.
By default, the messages logged by these services are saved in the /var/log/swift directory.
You can also configure these services to use separate syslog facilities by using the log_facility
parameter in one or virtually all of the object service configuration files and by updating the rsyslog
configuration. These parameters are described in the Swift Deployment Guide (docs.openstack.org/
developer/swift/deployment_guide.html) that is available in the OpenStack documentation.
An example of how to set up this configuration can be found in the SAIO - Swift All In
One documentation (docs.openstack.org/developer/swift/development_saio.html#optional-setting-up-
rsyslog-for-individual-logging) that is available in the OpenStack documentation.
Note: To configure rsyslog for unique log facilities in the protocol nodes, the administrator needs to make
sure that the manual steps mentioned in the preceding link are carried out on each of those protocol
nodes.
The Keystone authentication service writes its logging messages to /var/log/keystone/
keystone.log file. By default, Keystone logging is set to show messages at or beyond the WARNING
level.
For information on how to view or change log levels on any of the object-related services, see the “CES
tracing and debug data collection” on page 287 section.
The following commands can be used to determine the health of object services:
• To see whether there are any nodes in an active (failed) state, run the following command:
mmces state cluster OBJ
The system displays output similar to the following output:
• To retrieve the OBJ-related event entries, query the monitor client and grep for the name of the
component that you want to filter on. The component is object, proxy, account, container, Keystone, or
postgres. To see proxy-server related events, run the following command:
• To check the monitor log, grep for the component you want to filter on, either object, proxy, account,
container, keystone or postgres. For example, to see object-server related log messages:
grep object /var/adm/ras/mmsysmonitor.log | head -n 10
The system displays output similar to the following output:
2015-06-03T13:59:28.805-08:00 util5.sonasad.almaden.ibm.com D:522632:Thread-9:object:OBJ running command
'systemctl status openstack-swift-proxy'
2015-06-03T13:59:28.916-08:00 util5.sonasad.almaden.ibm.com D:522632:Thread-9:object:OBJ command result
ret:3 sout:openstack-swift-proxy.service - OpenStack Object Storage (swift) - Proxy Server
2015-06-03T13:59:28.916-08:00 util5.sonasad.almaden.ibm.com I:522632:Thread-9:object:OBJ openstack-swift-proxy is not
started, ret3
2015-06-03T13:59:28.916-08:00 util5.sonasad.almaden.ibm.com D:522632:Thread-9:object:OBJProcessMonitor openstack-swift-proxy
failed:
2015-06-03T13:59:28.916-08:00 util5.sonasad.almaden.ibm.com D:522632:Thread-9:object:OBJProcessMonitor memcached started
2015-06-03T13:59:28.917-08:00 util5.sonasad.almaden.ibm.com D:522632:Thread-9:object:OBJ running command
'systemctl status memcached'
2015-06-03T13:59:29.018-08:00 util5.sonasad.almaden.ibm.com D:522632:Thread-9:object:OBJ command result
ret:0 sout:memcached.service - Memcached
2015-06-03T13:59:29.018-08:00 util5.sonasad.almaden.ibm.com I:522632:Thread-9:object:OBJ memcached is started and active
running
2015-06-03T13:59:29.018-08:00 util5.sonasad.almaden.ibm.com D:522632:Thread-9:object:OBJProcessMonitor memcached succeeded
2015-06-03T13:59:29.018-08:00 util5.sonasad.almaden.ibm.com I:522632:Thread-9:object:OBJ service started checks
after monitor loop, event count:6
The following tables list the IBM Storage Scale for object storage log files.
Winbind logs
The winbind services write the most important messages to syslog.
When using Active Directory, the most important messages are written to syslog, similar to the logs in
SMB protocol. For example:
grep winbindd /var/log/messages
The system displays output similar to the following example:
Jun 3 12:04:34 prt001st001 winbindd[14656]: [2015/06/03 12:04:34.271459, 0] ../lib/util/become_daemon.c:124(daemon_ready)
Jun 3 12:04:34 prt001st001 winbindd[14656]: STATUS=daemon 'winbindd' finished starting up and ready to serve connections
To capture debug traces for Active Directory authentication, use mmprotocoltrace command for the
winbind component. To start the tracing of winbind component, issue this command:
mmprotocoltrace start winbind
After performing all steps, relevant for the trace, issue this command to stop tracing winbind component
and collect tracing data from all participating nodes:
mmprotocoltrace stop winbind
Note: There must be only one active trace. If you start multiple traces, you may need to remove the
previous data by using the mmprotocoltrace clear winbind command.
Related concepts
“Determining the health of integrated SMB server” on page 452
log4j.logger.org.apache.hadoop.yarn=DEBUG
log4j.logger.org.apache.hadoop.hdfs=DEBUG
log4j.logger.org.apache.hadoop.gpfs=DEBUG
log4j.logger.org.apache.hadoop.security=DEBUG
Note: Some of the authentication modules like keystone services log information also in /var/log/
messages.
You can remove the IP addresses and groups, and add them again after the upgrade. You can resolve this
issue by running the following steps:
1. 1. Remove all the CES IP addresses and then add them again.
Then, the cidrPool entry is created.
2. Run the mmlsconfig command to verify whether the IP address entry exists or not.
3. Run the following command to remove the group:
errpt -a
You can also grep the appropriate filename where syslog messages are redirected to. For example, in
Ubuntu, after the Natty release, this file is at /var/log/syslog.
On Windows, use the Event Viewer and look for events with a source label of GPFS in the Application
event category.
On Linux, syslog might include GPFS log messages and the error logs described in this section. The
systemLogLevel attribute of the mmchconfig command controls which GPFS log messages are sent
to syslog. It is recommended that some kind of monitoring for GPFS log messages be implemented,
particularly MMFS_FSSTRUCT errors. For more information, see the mmchconfig command in the IBM
Storage Scale: Command and Programming Reference Guide.
The error log contains information about several classes of events or errors. These classes are:
• “MMFS_ABNORMAL_SHUTDOWN” on page 275
• “MMFS_DISKFAIL” on page 275
MMFS_ABNORMAL_SHUTDOWN
The MMFS_ABNORMAL_SHUTDOWN error log entry means that GPFS has determined that it must shutdown
all operations on this node because of a problem. Insufficient memory on the node to handle critical
recovery situations can cause this error. In general, there are other error log entries from GPFS or some
other component associated with this error log entry.
MMFS_DISKFAIL
This topic describes the MMFS_DISKFAIL error log available in IBM Storage Scale.
The MMFS_DISKFAIL error log entry indicates that GPFS has detected the failure of a disk and forced the
disk to the stopped state. This is ordinarily not a GPFS error but a failure in the disk subsystem or the path
to the disk subsystem.
MMFS_ENVIRON
This topic describes the MMFS_ENVIRON error log available in IBM Storage Scale.
MMFS_ENVIRON error log entry records are associated with other records of the MMFS_GENERIC or
MMFS_SYSTEM_UNMOUNT types. They indicate that the root cause of the error is external to GPFS and
usually in the network that supports GPFS. Check the network and its physical connections. The data
portion of this record supplies the return code provided by the communications code.
MMFS_FSSTRUCT
This topic describes the MMFS_FSSTRUCT error log available in IBM Storage Scale.
The MMFS_FSSTRUCT error log entry indicates that GPFS has detected a problem with the on-disk
structure of the file system. The severity of these errors depends on the exact nature of the inconsistent
data structure. If it is limited to a single file, then EIO errors are reported to the application and operation
continues. If the inconsistency affects vital metadata structures, then operation ceases on this file
system. These errors are often associated with an MMFS_SYSTEM_UNMOUNT error log entry and probably
occurs on all nodes. If the error occurs on all nodes, some critical piece of the file system is inconsistent.
This can occur as a result of a GPFS error or an error in the disk system.
Note: When the mmhealth command displays an fsstruct error, the command prompts you to run a file
system check. When the problem is resolved, issue the following command to clear the fsstruct error
from the mmhealth command. You must specify the file system name twice:
If the file system is severely damaged, then the best course of action is to follow the procedures in
“Additional information to collect for file system corruption or MMFS_FSSTRUCT errors” on page 556, and
then contact the IBM Support Center.
MMFS_GENERIC
This topic describes MMFS_GENERIC error logs available in IBM Storage Scale.
The MMFS_GENERIC error log entry means that GPFS self diagnostics have detected an internal error,
or that additional information is being provided with an MMFS_SYSTEM_UNMOUNT report. If the record is
associated with an MMFS_SYSTEM_UNMOUNT report, the event code fields in the records are the same.
The error code and return code fields might describe the error. See “Messages” on page 728 for a listing
of codes generated by GPFS.
MMFS_LONGDISKIO
This topic describes the MMFS_LONGDISKIO error log available in IBM Storage Scale.
The MMFS_LONGDISKIO error log entry indicates that GPFS is experiencing very long response time for
disk requests. This is a warning message and can indicate that your disk system is overloaded or that a
failing disk is requiring many I/O retries. Follow your operating system's instructions for monitoring the
performance of your I/O subsystem on this node and on any disk server nodes that might be involved. The
data portion of this error record specifies the disk involved. There might be related error log entries from
the disk subsystems that points to the actual cause of the problem. If the disk is attached to an AIX node,
refer to AIX in IBM Documentation and search for performance management. To enable or disable, use the
mmchfs -w command. For more details, contact the IBM Support Center.
The mmpmon command can be used to analyze I/O performance on a per-node basis. For more
information, see “Monitoring I/O performance with the mmpmon command” on page 59 and “Failures
using the mmpmon command” on page 498.
MMFS_QUOTA
This topic describes the MMFS_QUOTA error log available in IBM Storage Scale.
The MMFS_QUOTA error log entry is used when GPFS detects a problem in the handling of quota
information. This entry is created when the quota manager has a problem reading or writing the quota file.
If the quota manager cannot read all entries in the quota file when mounting a file system with quotas
enabled, the quota manager shuts down but file system manager initialization continues. Mounts do not
succeed and return an appropriate error message (see “File system forced unmount” on page 382).
Quota accounting depends on a consistent mapping between user names and their numeric identifiers.
This means that a single user accessing a quota enabled file system from different nodes should map
to the same numeric user identifier from each node. Within a local cluster this is usually achieved by
ensuring that /etc/passwd and /etc/group are identical across the cluster.
When accessing quota enabled file systems from other clusters, you need to either ensure individual
accessing users have equivalent entries in /etc/passwd and /etc/group, or use the user identity mapping
facility as outlined in the IBM white paper UID Mapping for GPFS in a Multi-cluster Environment (https://
www.ibm.com/docs/en/storage-scale?topic=STXKQY/uid_gpfs.pdf).
It might be necessary to run an offline quota check (mmcheckquota command) to repair or recreate the
quota file. If the quota file is corrupted, then the mmcheckquota command does not restore it. The file
must be restored from the backup copy. If there is no backup copy, an empty file can be set as the new
quota file. This is equivalent to recreating the quota file. To set an empty file or use the backup file, issue
the mmcheckquota command with the appropriate operand:
• -u UserQuotaFilename for the user quota file
• -g GroupQuotaFilename for the group quota file
• -j FilesetQuotaFilename for the fileset quota file
After replacing the appropriate quota file, reissue the mmcheckquota command to check the file system
inode and space usage.
MMFS_SYSTEM_UNMOUNT
This topic describes the MMFS_SYSTEM_UNMOUNT error log available in IBM Storage Scale.
The MMFS_SYSTEM_UNMOUNT error log entry means that GPFS has discovered a condition that might
result in data corruption if operation with this file system continues from this node. GPFS has marked
the file system as disconnected and applications accessing files within the file system receives ESTALE
errors. This can be the result of:
• The loss of a path to all disks containing a critical data structure.
If you are using SAN attachment of your storage, consult the problem determination guides provided by
your SAN switch vendor and your storage subsystem vendor.
• An internal processing error within the file system.
See “File system forced unmount” on page 382. Follow the problem determination and repair actions
specified.
MMFS_SYSTEM_WARNING
This topic describes the MMFS_SYSTEM_WARNING error log available in IBM Storage Scale.
The MMFS_SYSTEM_WARNING error log entry means that GPFS has detected a system level value
approaching its maximum limit. This might occur as a result of the number of inodes (files) reaching
its limit. If so, issue the mmchfs command to increase the number of inodes for the file system so there is
at least a minimum of 5% free.
LABEL: MMFS_SYSTEM_UNMOUNT
IDENTIFIER: C954F85D
Description
STORAGE SUBSYSTEM FAILURE
Probable Causes
STORAGE SUBSYSTEM
COMMUNICATIONS SUBSYSTEM
Failure Causes
STORAGE SUBSYSTEM
COMMUNICATIONS SUBSYSTEM
Recommended Actions
CONTACT APPROPRIATE SERVICE REPRESENTATIVE
Detail Data
EVENT CODE
15558007
STATUS CODE
212
VOLUME
gpfsd
With the BASIC option, the Transparent cloud tiering service debugs information such as logs, traces,
Java™ cores, along with minimal system and IBM Storage Scale cluster information is collected. No
customer sensitive information is collected.
With the FULL option, extra details such as Java Heap dump are collected, along with the information
captured with the BASIC option.
Successful invocation of this command generates a new .tar file at a specified location, and the file can be
shared with IBM support team to debug a field issue.
cephMon = "/opt/IBM/zimon/CephMonProxy"
cephRados = "/opt/IBM/zimon/CephRadosProxy"
colCandidates = "nsd003st001", "nsd004st001"
colRedundancy = 2
collectors = {
host =""
port = "4739"
}
config = "/opt/IBM/zimon/ZIMonSensors.cfg"
ctdbstat = ""
daemonize = T
hostname = ""
ipfixinterface = "0.0.0.0"
logfile = "/var/log/zimon/ZIMonSensors.log"
loglevel = "info"
fs.suid_dumpable = 2
sysctl -p
cat /proc/sys/fs/suid_dumpable
Trace facility
The IBM Storage Scale system includes many different trace points to facilitate rapid problem
determination of failures.
IBM Storage Scale tracing is based on the kernel trace facility on AIX, embedded GPFS trace subsystem
on Linux, and the Windows ETL subsystem on Windows. The level of detail that is gathered by the trace
facility is controlled by setting the trace levels using the mmtracectl command.
The mmtracectl command sets up and enables tracing using default settings for various common
problem situations. Using this command improves the probability of gathering accurate and reliable
problem determination information. For more information about the mmtracectl command, see the IBM
Storage Scale: Command and Programming Reference Guide.
mmchconfig dataStructureDump=path_for_storage_of_dumps
mmtracectl --start
5. The output of the GPFS trace facility is stored in /tmp/mmfs, unless the location was changed using
the mmchconfig command in Step “1” on page 283. Save this output.
6. If the problem results in a shutdown and restart of the GPFS daemon, set the traceRecycle variable
as necessary to start tracing automatically on daemon startup and stop the trace automatically on
daemon shutdown.
If the problem requires more detailed tracing, the IBM Support Center might ask you to modify the GPFS
trace levels. Use the mmtracectl command to establish the required trace classes and levels of tracing.
The syntax to modify trace classes and levels is as follows:
mmtracectl --set --trace={io | all | def | "Class Level [Class Level ...]"}
For example, to tailor the trace level for I/O, issue the following command:
Once the trace levels are established, start the tracing by issuing:
mmtracectl --start
After the trace data has been gathered, stop the tracing by issuing:
mmtracectl --stop
To clear the trace settings and make sure tracing is turned off, issue:
mmtracectl --off
Other possible values that can be specified for the trace Class include:
afm
active file management
alloc
disk space allocation
allocmgr
allocation manager
basic
'basic' classes
brl
byte range locks
ccr
cluster configuration repository
cksum
checksum services
cleanup
cleanup routines
cmd
ts commands
defrag
defragmentation
dentry
dentry operations
dentryexit
daemon routine entry/exit
--tracedev-write-mode=blocking specifies that if the trace buffer is full, wait until the trace data
is written to the local disk and the buffer becomes available again to overwrite the old data. This is the
default. --tracedev-write-mode=overwrite specifies that if the trace buffer is full, overwrite the
old data.
Note: Before switching between --tracedev-write-mode=overwrite and --tracedev-write-
mode=blocking, or vice versa, run the mmtracectl --stop command first. Next, run the
mmtracectl --set --tracedev-write-mode command to switch to the desired mode. Finally,
restart tracing with the mmtracectl --start command.
For more information about the mmtracectl command, see the IBM Storage Scale: Command and
Programming Reference Guide.
Data collection
To diagnose the cause of an issue, it might be necessary to gather some extra information from the
cluster. This information can then be used to determine the root cause of an issue.
Collection of debugging information, such as configuration files and logs, can be gathered by using
the gpfs.snap command. This command gathers data about GPFS, operating system information, and
information for each of the protocols. Following services can be traced by gpfs.snap command:
GPFS + OS
GPFS configuration and logs plus operating system information such as network configuration or
connected drives.
NFS
CES NFS configuration and logs.
SMB
SMB and CTDB configuration and logs.
OBJECT
Openstack Swift and Keystone configuration and logs.
AUTHENTICATION
Authentication configuration and logs.
PERFORMANCE
Dump of the performance monitor database.
Information for each of the enabled protocols is gathered automatically when the gpfs.snap command
is run. If any protocol is enabled, then information for CES and authentication is gathered.
To gather performance data, add the --performance option. The --performance option causes
gpfs.snap to try to collect performance information.
Note: Gather the performance data only if necessary, as this process can take up to 30 minutes to run.
If data is only required for one protocol or area, the automatic collection can be bypassed. Provided one
or more of the following options to the --protocol argument: smb,nfs,object,ces,auth,none
If the --protocol command is provided, automatic data collection is disabled. If --protocol
smb,nfs is provided to gpfs.snap, only NFS and SMB information is gathered and no CES or
Authentication data is collected. To disable all protocol data collection, use the argument --protocol
none.
Types of tracing
Tracing is logging at a high level. The mmprotocoltrace command that is used for starting and stopping
tracing supports SMB, Winbind, Network, and Object tracing.
NFS tracing can be done with a combination of commands.
NFS
NFS tracing is achieved by increasing the log level, repeating the issue, capturing the log file, and then
restoring the log level.
To increase the log level, use the command mmnfs config change LOG_LEVEL=FULL_DEBUG.
The mmnfs config change command restarts the server on all nodes. You can increase the log
level by using ganesha_mgr. This command takes effect without restart, only on the node on which
the command is run.
You can set the log level to the following values: NULL,FATAL, MAJ, CRIT, WARN, EVENT, INFO,
DEBUG, MID_DEBUG, and FULL_DEBUG.
FULL_DEBUG is the most useful for debugging purposes. This command collects large amount of data,
straining disk usage and affecting performance.
After the issue is re-created by running the gpfs.snap command either with no arguments or with
the --protocol nfs argument, the NFS logs are captured. The logs can then be used to diagnose
any issues.
To return the log level to normal, use the same command but with a lower logging level. The default
value is EVENT.
HDFS
CES also supports HDFS protocols. For more information, see CES HDFS troubleshooting under IBM
Storage Scale support for Hadoop in Big data and analytics support documentation.
The response to this command displays the current settings from the trace configuration file. For more
information about this file, see “Trace configuration file” on page 291.
2. Clear the trace records from the previous trace of the same type:
This command responds with an error message if the previous state of a trace node is something other
than DONE or FAILED. If this error occurs, follow the instructions in the “Resetting the trace system”
on page 292 section.
3. Start the new trace:
Setting up traces
Trace '5d3f0138-9655-4970-b757-52355ce146ef' created successfully for 'smb'
Waiting for all participating nodes...
Trace ID: 5d3f0138-9655-4970-b757-52355ce146ef
State: ACTIVE
Protocol: smb
Start Time: 09:22:46 24/04/20
End Time: 09:32:46 24/04/20
Trace Location: /tmp/mmfs/smb.20200424_092246.trc
Origin Node: ces5050-41.localnet.com
Client IPs: 10.0.100.42, 10.0.100.43
Syscall: False
Syscall Only: False
Nodes:
Node Name: ces5050-41.localnet.com
State: ACTIVE
Node Name: ces5050-42.localnet.com
State: ACTIVE
Node Name: ces5050-43.localnet.com
State: ACTIVE
If the status of a node is FAILED, the node did not start successfully. Look at the logs for the node
to determine the problem. After you fix the problem, reset the trace system by following the steps in
“Resetting the trace system” on page 292.
4. If all the nodes started successfully, perform the actions that you want to trace. For example, if you are
tracing a client IP address, enter commands that create traffic on that client.
5. Stop the trace:
Stopping traces
Trace '01239483-be84-wev9-a2d390i9ow02' stopped for smb
Waiting for traces to complete
Waiting for node 'node1'
Waiting for node 'node2'
Finishing trace '01239483-be84-wev9-a2d390i9ow02'
Trace tar file has been written to '/tmp/mmfs/smb.20150513_162322.trc/
smb.trace.20150513_162542.tar.gz'
If you do not stop the trace, it continues until the trace duration expires. For more information, see
“Trace timeout” on page 290.
6. Look in the trace log files for the results of the trace. For more information, see “Trace log files” on
page 290.
Trace timeout
If you do not stop a trace manually, the trace runs until its trace duration expires. The default trace
duration is 10 minutes, but you can set a different value in the mmprotocoltrace command.
Each node that participates in a trace starts a timeout process that is set to the trace duration. When a
timeout occurs, the process checks the trace status. If the trace is active, the process stops the trace,
writes the file location to the log file, and exits. If the trace is not active, the timeout process exits.
If a trace stops because of a timeout, you can use the mmprotocoltrace status command to find the
location of the trace log file. The command gives an output similar to the following:
You cannot proceed with the new trace, unless you have removed the old trace results.
If a trace cannot be stopped and cleared as described, you must perform the following recovery
procedure:
1. Run the mmprotocoltrace clear command in the force mode to clear the trace as shown:
Note: After a forced clear, the trace system might still be in an invalid state.
2. Run the mmprotocoltrace reset command as shown:
The command also performs special actions for each type of trace:
• For an SMB trace, the reset removes any IP-specific configuration files and sets the log level and log
location to the default values.
• For a Network trace, the reset stops all dumpcap processes.
• For an Object trace, the reset sets the log level to the default value. It then sets the log location to the
default location in the rsyslog configuration file, and restarts the rsyslog service.
The following command resets the SMB trace:
A client might have more than one IP address, as in the following example where the command ip is run
on client ch-44:
In such a case, specify all the possible IP addresses in the mmprotocoltrace command because you
cannot be sure which IP address the client uses. The following example specifies all the IP addresses that
the previous example listed for client ch-44, and by default all CES nodes trace incoming connections
from any of these IP addresses:
Sharing the diagnostic data with the IBM Support using call home
The call home shares support information and your contact information with IBM on a scheduled basis.
The IBM Support monitors the details that are shared by the call home and takes necessary action in case
of any issues or potential issues. Enabling call home reduces the response time for the IBM Support to
address the issues.
You can also manually upload the diagnostic data that is collected through the Support > Diagnostic Data
page in the GUI to share the diagnostic data to resolve a Problem Management Record (PMR). To upload
data manually, perform the following steps:
1. Go to Support > Diagnostic Data.
2. Collect diagnostic data based on the requirement. You can also use the previously collected data for
the upload.
3. Select the relevant data set from the Previously Collected Diagnostic Data section and then right-
click and select Upload to PMR.
4. Select the PMR to which the data must be uploaded and then click Upload.
GUI data
The following data is collected to enable performance monitoring diagnosis:
• The output of these commands:
– pg_dump -U postgres -h 127.0.0.1 -n fscc postgres
– /usr/lpp/mmfs/gui/bin/get_version
– getent passwd scalemgmt
– getent group scalemgmt
– iptables -L -n
– iptables -L -n -t nat
– systemctl kill gpfsgui --signal=3 --kill-who=main # trigger a core dump
– systemctl status gpfsgui
– journalctl _SYSTEMD_UNIT=gpfsgui.service --no-pager -l
• The content of these files:
– /etc/sudoers
– /etc/sysconfig/gpfsgui
– /opt/ibm/wlp/usr/servers/gpfsgui/*.xml
– /var/lib/pgsql/data/*.conf
– /var/lib/pgsql/data/pg_log/*
– /var/lib/mmfs/gui/*
– /var/log/cnlog/*
– /var/crash/scalemgmt/javacore*
– /var/crash/scalemgmt/heapdump*
– /var/crash/scalemgmt/Snap*
– /usr/lpp/mmfs/gui/conf/*
• The output of these commands is collected once for the cluster:
– /usr/lpp/mmfs/lib/ftdc/mmlssnap.sh
• The content of these CCR files is collected once for the cluster:
– _gui.settings
– _gui.user.repo
– _gui.dashboards
– _gui.snapshots
– key-value pair: gui_master_node
Data gathered by gpfs.snap for File audit logging and Watchfolder components
These items are always obtained by the gpfs.snap command when data is gathered for File audit
logging and Watchfolder components:
1. The output of these commands:
• rpm -qi
For gpfs.librdkafka ganesha packages or dpkg-query on Ubuntu
• mmdiag --eventproducer -Y
• mmwatch all list -Y
• tslspolicy <dev> -L --ptn
2. The contents of these files:
• /var/adm/ras/mmmsgqueue.log
• /var/adm/ras/mmaudit.log
• /var/adm/ras/mmwf.log
• /var/adm/ras/mmwatch.log
• /var/adm/ras/tswatchmonitor.log
• /var/adm/ras/mmwfclient.log
• Watchfolder configuration file (/<Device>/.msgq/.audit/.config)
• File audit logging configuration file (/<Device>/.msgq/<watchID>/.config)
Synopsis
mmdumpperfdata [--remove-tree] [StartTime EndTime | Duration]
Availability
Available on all IBM Storage Scale editions.
Description
The mmdumpperfdata command runs all named queries and computed metrics used in the mmperfmon
query command for each cluster node, writes the output into CSV files, and archives all the files in a
single .tgz file. The file name is in the iss_perfdump_YYYYMMDD_hhmmss.tgz format.
The tar archive file contains a folder for each cluster node and within that folder there is a text file with the
output of each named query and computed metric.
If the start and end time, or duration are not given, then by default the last four hours of metrics
information is collected and archived.
Parameters
--remove-tree or -r
Removes the folder structure that was created for the TAR archive file.
StartTime
Specifies the start timestamp for query in the YYYY-MM-DD[-hh:mm:ss] format.
EndTime
Specifies the end timestamp for query in the YYYY-MM-DD[-hh:mm:ss] format.
Duration
Specifies the duration in seconds.
Exit status
0
Successful completion.
nonzero
A failure has occurred.
Security
You must have root authority to run the mmdumpperfdata command.
The node on which the command is issued must be able to execute remote shell commands on any other
node in the cluster without the use of a password and without producing any extraneous messages.
For more information, see Requirements for administering a GPFS file system in IBM Storage Scale:
Administration Guide.
Examples
1. To archive the performance metric information collected for the default time period of last four hours
and also delete the folder structure that the command creates, issue this command:
mmdumpperfdata --remove-tree
2. To archive the performance metric information collected for a specific time period, issue this
command:
3. To archive the performance metric information collected in the last 200 seconds, issue this command:
See also
For more information, see mmperfmon command in the IBM Storage Scale: Command and Programming
Reference Guide.
mmfsadm command
The mmfsadm command is intended for use by trained service personnel. IBM suggests you do not run
this command except under the direction of such personnel.
Note: The contents of the mmfsadm command output might vary from release to release, which could
obsolete any user programs that depend on that output. Therefore, we suggest that you do not create
user programs that invoke the mmfsadm command.
The mmfsadm command extracts data from GPFS without using locking, so that it can collect the data in
the event of locking errors. In certain rare cases, this can cause GPFS or the node to fail. Several options
of this command exist and might be required for use:
cleanup
Delete shared segments left by a previously failed GPFS daemon without actually restarting the
daemon.
dump what
Dumps the state of a large number of internal state values that might be useful in determining the
sequence of events. The what parameter can be set to all, indicating that all available data should be
collected, or to another value, indicating more restricted collection of data. The output is presented to
STDOUT and should be collected by redirecting STDOUT. For more information about internal GPFS™
states, see the mmdiag command in IBM Storage Scale: Command and Programming Reference Guide.
showtrace
Shows the current level for each subclass of tracing available in GPFS. Trace level 14 provides the
highest level of tracing for the class and trace level 0 provides no tracing. Intermediate values exist for
most classes. More tracing requires more storage and results in a higher probability of overlaying the
required event.
trace class n
Sets the trace class to the value specified by n. Actual trace gathering only occurs when the
mmtracectl command has been issued.
Other options provide interactive GPFS debugging, but are not described here. Output from the mmfsadm
command is required in almost all cases where a GPFS problem is being reported. The mmfsadm
command collects data only on the node where it is issued. Depending on the nature of the problem,
the mmfsadm command output might be required from several or all nodes. The mmfsadm command
output from the file system manager is often required.
To determine where the file system manager is, issue the mmlsmgr command:
mmlsmgr
mmgetstate -L -a
Node number Node name Quorum Nodes up Total nodes GPFS state Remarks
--------------------------------------------------------------------
2 k154n06 1* 3 7 active quorum node
3 k155n05 1* 3 7 active quorum node
4 k155n06 1* 3 7 active quorum node
5 k155n07 1* 3 7 active
6 k155n08 1* 3 7 active
9 k156lnx02 1* 3 7 active
11 k155n09 1* 3 7 active
mmlscluster
The mmlscluster command is fully described in the Command reference section in the IBM Storage
Scale: Command and Programming Reference Guide.
mmlsconfig
The mmlsconfig command is fully described in the Command reference section in the IBM Storage Scale:
Command and Programming Reference Guide.
The -f flag can be used to force the GPFS cluster configuration data files to be rebuilt whether they
appear to be at the most current level or not. If no other option is specified, the command affects only the
mmrefresh -a
Or,
Or,
mmexpelnode -N c100c1rp3
mmexpelnode --list
Node List
---------------------
192.168.100.35 (c100c1rp3.ppd.pok.ibm.com)
mmexpelnode -r -N c100c1rp3
mmlsmount all -L
The mmlsmount command is fully described in the Command reference section in the IBM Storage Scale:
Command and Programming Reference Guide.
/* Exclusion rule */
RULE 'exclude *.save files' EXCLUDE WHERE NAME LIKE '%.save'
/* Deletion rule */
RULE 'delete' DELETE FROM POOL 'sp1' WHERE NAME LIKE '%tmp%'
/* Migration rule */
RULE 'migration to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system' WHERE NAME LIKE
'%file%'
/* Typo in rule : removed later */
RULE 'exclude 2' EXCULDE
/* List rule */
RULE EXTERNAL LIST 'tmpfiles' EXEC '/tmp/exec.list'
RULE 'all' LIST 'tmpfiles' where name like '%tmp%'
The mmapplypolicy command is fully described in the Command reference section in the IBM Storage
Scale: Command and Programming Reference Guide.
mmapplypolicy -L 0
Use this option to display only serious errors.
In this example, there is an error in the policy file. This command:
/* Typo in rule */
RULE 'exclude 2' EXCULDE
mmapplypolicy -L 2
Use this option to display all of the information from the previous levels, plus each chosen file and the
scheduled migration or deletion action.
This command:
mmapplypolicy -L 3
Use this option to display all of the information from the previous levels, plus each candidate file and the
applicable rule.
This command:
mmapplypolicy -L 4
Use this option to display all of the information from the previous levels, plus the name of each explicitly
excluded file, and the applicable rule.
This command:
[I] Directories scan: 10 files, 1 directories, 0 other objects, 0 'skipped' files and/or errors.
/fs1/file1.save RULE 'exclude *.save files' EXCLUDE
/fs1/file2.save RULE 'exclude *.save files' EXCLUDE
/fs1/file.tmp1 RULE 'delete' DELETE FROM POOL 'sp1' WEIGHT(INF)
/fs1/file.tmp1 RULE 'all' LIST 'tmpfiles' WEIGHT(INF)
/fs1/file.tmp0 RULE 'delete' DELETE FROM POOL 'sp1' WEIGHT(INF)
/fs1/file.tmp0 RULE 'all' LIST 'tmpfiles' WEIGHT(INF)
/fs1/file1 RULE 'migration to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system'
WEIGHT(INF)
/fs1/file0 RULE 'migration to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system'
WEIGHT(INF)
mmapplypolicy -L 5
Use this option to display all of the information from the previous levels, plus the attributes of candidate
and excluded files.
These attributes include:
• MODIFICATION_TIME
• USER_ID
• GROUP_ID
• FILE_SIZE
• POOL_NAME
• ACCESS_TIME
• KB_ALLOCATED
• FILESET_NAME
This command:
[I] Directories scan: 10 files, 1 directories, 0 other objects, 0 'skipped' files and/or errors.
/fs1/file1.save [2022-03-03@21:19:57 0 0 16384 sp1 2022-03-04@02:09:38 16 root] RULE 'exclude \
*.save files' EXCLUDE
/fs1/file2.save [2022-03-03@21:19:57 0 0 16384 sp1 2022-03-03@21:19:57 16 root] RULE 'exclude \
*.save files' EXCLUDE
/fs1/file.tmp1 [2022-03-04@02:09:31 0 0 0 sp1 2022-03-04@02:09:31 0 root] RULE 'delete' DELETE
\
FROM POOL 'sp1' WEIGHT(INF)
/fs1/file.tmp1 [2022-03-04@02:09:31 0 0 0 sp1 2022-03-04@02:09:31 0 root] RULE 'all' LIST \
'tmpfiles' WEIGHT(INF)
/fs1/file.tmp0 [2022-03-04@02:09:38 0 0 16384 sp1 2022-03-04@02:09:38 16 root] RULE 'delete' \
DELETE FROM POOL 'sp1' WEIGHT(INF)
/fs1/file.tmp0 [2022-03-04@02:09:38 0 0 16384 sp1 2022-03-04@02:09:38 16 root] RULE 'all' \
LIST 'tmpfiles' WEIGHT(INF)
/fs1/file1 [2022-03-03@21:32:41 0 0 16384 sp1 2022-03-03@21:32:41 16 root] RULE 'migration
\
to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system' WEIGHT(INF)
/fs1/file0 [2022-03-03@21:21:11 0 0 16384 sp1 2022-03-03@21:32:41 16 root] RULE 'migration
\
to system pool' MIGRATE FROM POOL 'sp1' TO POOL 'system' WEIGHT(INF)
mmapplypolicy -L 6
Use this option to display all of the information from the previous levels, plus files that are not candidate
files, and their attributes.
These attributes include:
• MODIFICATION_TIME
• USER_ID
• GROUP_ID
• FILE_SIZE
• POOL_NAME
contains information about the data1 file, which is not a candidate file.
To find out the local device names for these disks, use the mmlsnsd command with the -m option. For
example, issuing mmlsnsd -m produces output similar to this:
To obtain extended information for NSDs, use the mmlsnsd command with the -X option. For example,
issuing mmlsnsd -X produces output similar to this:
The mmlsnsd command is fully described in the Command reference section in the IBM Storage Scale:
Administration Guide.
Where:
Disk
is the Windows disk number as shown in the Disk Management console and the DISKPART command-
line utility.
Avail
shows the value YES when the disk is available and in a state suitable for creating an NSD.
GPFS Partition ID
is the unique ID for the GPFS partition on the disk.
The mmwindisk command does not provide the NSD volume ID. You can use mmlsnsd -m to find the
relationship between NSDs and devices, which are disk numbers on Windows.
Before you run mmfileid, you must run a disk analysis utility and obtain the disk sector numbers that are
damaged or suspect. These sectors are input to the mmfileid command.
The command syntax is as follows:
mmfileid Device
{-d DiskDesc | -F DescFile}
[-o OutputFile] [-f NumThreads] [-t Directory]
[-N {Node[,Node...] | NodeFile | NodeClass}] [--qos QOSClass]
NodeName:DiskName[:PhysAddr1[-PhysAddr2]]
:{NsdName|DiskNum|BROKEN}[:PhysAddr1[-PhysAddr2]]
NodeName
Specifies a node in the GPFS cluster that has access to the disk to scan. You must specify this
value if the disk is identified with its physical volume name. Do not specify this value if the disk is
identified with its NSD name or its GPFS disk ID number, or if the keyword BROKEN is used.
DiskName
Specifies the physical volume name of the disk to scan as known on node NodeName.
NsdName
Specifies the GPFS NSD name of the disk to scan.
DiskNum
Specifies the GPFS disk ID number of the disk to scan as displayed by the mmlsdisk -L
command.
BROKEN
Tells the command to scan all the disks in the file system for files with broken addresses that
result in lost data.
k148n07:hdisk9:2206310-2206810
:gpfs1008nsd:
:10:27645856
:BROKEN
-F DescFile
Specifies a file that contains a list of disk descriptors, one per line.
-f NumThreads
Specifies the number of worker threads to create. The default value is 16. The minimum value
is 1. The maximum value is the maximum number allowed by the operating system function
pthread_create for a single process. A suggested value is twice the number of disks in the file
system.
-N {Node[,Node...] | NodeFile | NodeClass}
Specifies the list of nodes that participate in determining the disk addresses. This command supports
all defined node classes. The default is all or the current value of the defaultHelperNodes
configuration parameter of the mmchconfig command.
For general information on how to specify node names, see Specifying nodes as input to GPFS
commands in the IBM Storage Scale: Administration Guide.
-o OutputFile
The path name of a file to which the result from the mmfileid command is to be written. If not
specified, the result is sent to standard output.
-t Directory
Specifies the directory to use for temporary storage during mmfileid command processing. The
default directory is /tmp.
--qos QOSClass
Specifies the Quality of Service for I/O operations (QoS) class to which the instance of the command
is assigned. If you do not specify this parameter, the instance of the command is assigned by default
to the maintenance QoS class. This parameter has no effect unless the QoS service is enabled. For
more information, see the help topic on the mmchqos command in the IBM Storage Scale: Command
and Programming Reference Guide. Specify one of the following QoS classes:
maintenance
This QoS class is typically configured to have a smaller share of file system IOPS. Use this class for
I/O-intensive, potentially long-running GPFS commands, so that they contribute less to reducing
overall file system performance.
other
This QoS class is typically configured to have a larger share of file system IOPS. Use this class for
administration commands that are not I/O-intensive.
For more information, see the help topic on Setting the Quality of Service for I/O operations (QoS) in
the IBM Storage Scale: Administration Guide.
You can redirect the output to a file with the -o flag and sort the output on the inode number with the
sort command.
The mmfileid command output contains one line for each inode found to be on a corrupted disk sector.
Each line of the command output has this format:
InodeNumber
Indicates the inode number of the file identified by mmfileid.
k148n07:hdisk9:2206310-2206810
k148n07:hdisk8:2211038-2211042
k148n07:hdisk8:2201800-2202800
k148n01:hdisk6:2921879-2926880
k148n09:hdisk7:1076208-1076610
The lines that begin with the word Address represent GPFS system metadata files or reserved disk areas.
If your output contains any lines like these, do not attempt to replace or repair the indicated files. If you
suspect that any of the special files are damaged, call the IBM Support Center for assistance.
The following line of output indicates that inode number 14336, disk address 1072256 contains file /
gpfsB/tesDir/testFile.out. The 0 to the left of the name indicates that the file does not belong to a
snapshot. This file is on a potentially bad disk sector area:
The following line of output indicates that inode number 14344, disk address 2922528 contains file /
gpfsB/x.img. The 1 to the left of the name indicates that the file belongs to snapshot number 1. This file
is on a potentially bad disk sector area:
Usage:
mmperfmon query Metric[,Metric...] | Key[,Key...] | NamedQuery [StartTime EndTime | Duration]
[Options]
OR
mmperfmon query compareNodes ComparisonMetric [StartTime EndTime | Duration] [Options]
where
Metric metric name
Key a key consisting of node name, sensor group, optional additional
filters,
metric name, separated by pipe symbol
e.g.: "cluster1.ibm.com|CTDBStats|locking|db_hop_count_bucket_00"
NamedQuery name of a pre-defined query
ComparisonMetric name of a metric to be compared if using CompareNodes
StartTime Start timestamp for query
Format: YYYY-MM-DD-hh:mm:ss
EndTime End timestamp for query. Omitted means: execution time
Format: YYYY-MM-DD-hh:mm:ss
Duration Number of seconds into the past from today or <EndTime>
Options:
-h, --help show this help message and exit
-N NodeName, --Node=NodeName
Defines the node that metrics should be retrieved from
-b BucketSize, --bucket-size=BucketSize
Defines a bucket size (number of seconds), default is
1
-n NumberBuckets, --number-buckets=NumberBuckets
Number of buckets ( records ) to show, default is 10
--filter=Filter Filter criteria for the query to run
--format=Format Common format for all columns
--csv Provides output in csv format.
--raw Provides output in raw format rather than a pretty
table format.
--nice Use colors and other text attributes for output.
--resolve Resolve computed metrics, show metrics used
--short Shorten column names if there are too many to fit into
one row.
--list=List Show list of specified values (overrides other
For more information on monitoring performance and analyzing performance related issues, see “Using
the performance monitoring tool” on page 105 and mmperfmon command in the IBM Storage Scale:
Command and Programming Reference Guide
The short URL points to this help topic to make it easier to find the information later.
By default, debug data is put into the /tmp/mmfs directory, or the directory specified for the
dataStructureDump configuration parameter, on each node. Plenty of disk space, typically many GBs,
needs to be available. Debug data is not collected when the directory runs out of disk space.
Important: Before you change the value of dataStructureDump, stop the GPFS trace. Otherwise, you
lose the GPFS trace data. Restart the GPFS trace afterwards.
After a potential deadlock is detected and the relevant debug data is collected, IBM Service needs to be
contacted to report the problem and to upload the debug data. Outdated debug data needs to be removed
to make room for new debug data in case a new potential deadlock is detected.
Sat Jul 18 09:52:04.626 2015: [A] Unexpected long waiter detected: Waiting 905.9380 sec since
2015-07-18 09:36:58, on node c33f2in01,
SharedHashTabFetchHandlerThread 8397: on MsgRecordCondvar,
reason 'RPC wait' for tmMsgTellAcquire1
The /var/log/messages file on Linux and the error log on AIX also log an entry for the deadlock
detection, but the mmfs.log file has most details.
The deadlockDetected event is triggered on "Unexpected long waiter detected" and any user program
that is registered for the event is invoked. The user program can be made for recording and notification
purposes. See /usr/lpp/mmfs/samples/deadlockdetected.sample for an example and more
information.
When the flagged waiter disappears, an entry like the following one might appear in the mmfs.log file:
Sat Jul 18 10:00:05.705 2015: [N] The unexpected long waiter on thread 8397 has disappeared in 1386 seconds.
The mmdiag --deadlock command shows the flagged waiter and possibly other waiters closely behind
which also passed the threshold for deadlock detection
If the flagged waiter disappears on its own, without any deadlock breakup actions, then the flagged
waiter is not a real deadlock, and the detection is a false positive. A reasonable threshold needs to be
established to reduce false positive deadlock detection. It is a good practice to consider the trade-off
between waiting too long and not having a timely detection and not waiting long enough causing a
false-positive detection.
A false positive deadlock detection and debug data collection are not necessarily a waste of resources. A
long waiter, even if it eventually disappears on its own, likely indicates that something is not working well,
and is worth looking into.
The configuration parameter deadlockDetectionThreshold is used to specify the initial threshold for
deadlock detection. GPFS code adjusts the threshold on each node based on what's happening on the
node and cluster. The adjusted threshold is the effective threshold used in automated deadlock detection.
mmlsconfig deadlockDetectionThreshold
mmlsconfig deadlockDetectionThresholdForShortWaiters
deadlockDetectionThreshold 300
deadlockDetectionThresholdForShortWaiters 60
What debug data is collected depends on the value of the configuration parameter debugDataControl.
The default value is light and a minimum amount of debug data, the data that is most frequently
needed to debug a GPFS issue, is collected. The value medium gets more debug data collected. The
mmlsconfig deadlockDataCollectionDailyLimit
deadlockDataCollectionDailyLimit 3
mmlsconfig deadlockBreakupDelay
deadlockBreakupDelay 0
The value of 0 shows that automated deadlock breakup is disabled. To enable automated deadlock
breakup, specify a positive value for deadlockBreakupDelay. If automated deadlock breakup is to be
enabled, a delay of 300 seconds or longer is recommended.
Automated deadlock breakup is done on a node-by-node basis. If automated deadlock breakup is
enabled, the breakup process is started when the suspected deadlock waiter is detected on a node.
The process first waits for the deadlockBreakupDelay, and then goes through various phases until the
deadlock waiters disappear. There is no central coordination on the deadlock breakup, so the time to take
deadlock breakup actions may be different on each node. Breaking up a deadlock waiter on one node can
cause some deadlock waiters on other nodes to disappear, so no breakup actions need to be taken on
those other nodes.
If a suspected deadlock waiter disappears while waiting for the deadlockBreakupDelay, the
automated deadlock breakup process stops immediately without taking any further action. To lessen
the number of breakup actions that are taken in response to detecting a false-positive deadlock, increase
If the mmcommon breakDeadlock command is issued without the -N parameter, then every node in the
cluster receives a request to take action on any long waiter that is a suspected deadlock.
If the mmcommon breakDeadlock command is issued with the -N parameter, then only the nodes
that are specified receive a request to take action on any long waiter that is a suspected deadlock. For
example, assume that there are two nodes, called node3 and node6, that require a deadlock breakup. To
send the breakup request to just these nodes, issue the following command:
Shortly after running the mmcommon breakDeadlock command, issue the following command:
The output of the mmdsh command can be used to determine if any deadlock waiters still exist and if any
additional actions are needed.
[N] Received deadlock breakup request from 192.168.40.72: No deadlock to break up.
The mmcommon breakDeadlock command provides more control over breaking up deadlocks, but
multiple breakup requests might be required to achieve satisfactory results. All waiters that exceeded the
deadlockDetectionThreshold might not disappear when mmcommon breakDeadlock completes
on a node. In complicated deadlock scenarios, some long waiters can persist after the longest
waiters disappear. Waiter length can grow to exceed the deadlockDetectionThreshold at any
point, and waiters can disappear at any point as well. Examine the waiter situation after mmcommon
breakDeadlock completes to determine whether the command must be repeated to break up the
deadlock.
Another way to break up a deadlock on demand is to enable automated deadlock breakup by changing
deadlockBreakupDelay to a positive value. By enabling automated deadlock breakup, breakup actions
are initiated on existing deadlock waiters. The breakup actions repeat automatically if deadlock waiters
are detected. Change deadlockBreakupDelay back to 0 when the results are satisfactory, or when you
want to control the timing of deadlock breakup actions again. If automated deadlock breakup remains
enabled, breakup actions start on any newly detected deadlocks without any intervention.
Finding deployment related error messages more easily and using them for
failure analysis
Use this information to find and analyze error messages related to installation, deployment, and upgrade
from the respective logs when using the installation toolkit.
In case of any installation, deployment, and upgrade related error:
1. Go to the end of the corresponding log file and search upwards for the text FATAL.
2. Find the topmost occurrence of FATAL (or first FATAL error that occurred) and look above and below
this error for further indications of the failure.
ssh HostNameofFirstNode
ssh HostNameofSecondNode
b. Verify that the user can log into the node by using the FQDN of the node successfully without being
prompted for any input and that there are no warnings.
ssh FQDNofFirstNode
ssh FQDNofSecondNode
ssh IPAddressofFirstNode
ssh IPAddressofSecondNode
ssh-keygen
ssh-copy-id FQDNofFirstNode
ssh-copy-id FQDNofSecondNode
ssh-copy-id HostNameofFirstNode
ssh-copy-id HostNameofSecondNode
ssh-copy-id IPAddressofFirstNode
ssh-copy-id IPAddressofSecondNode
Repository setup
• Verify that the repository is set up depending on your operating system. For example, verify that yum
repository is set up by using the following command on all cluster nodes.
yum repolist
This command should run clean with no errors if the Yum repository is set up.
Firewall configuration
It is recommended that firewalls are in place to secure all nodes. For more information, see Securing the
IBM Storage Scale system using firewall in IBM Storage Scale: Administration Guide.
• If you need to open specific ports, use the following steps on Red Hat Enterprise Linux nodes.
# ifconfig -a
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>mtu 1500
inet 192.168.251.161 netmask 255.255.254.0 broadcast 192.168.251.255
inet6 2002:90b:e006:84:250:56ff:fea5:1d86 prefixlen 64 scopeid 0x0<global>
inet6 fe80::250:56ff:fea5:1d86 prefixlen 64 scopeid 0x20<link>
ether 00:50:56:a5:1d:86 txqueuelen 1000 (Ethernet)
RX packets 1978638 bytes 157199595 (149.9 MiB)
RX errors 0 dropped 2291 overruns 0 frame 0
TX packets 30884 bytes 3918216 (3.7 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
# ip addr
2: eth0:<BROADCAST,MULTICAST,UP,LOWER_UP>mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:a5:1d:86 brd ff:ff:ff:ff:ff:ff
inet 192.168.251.161/23 brd 192.168.251.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 2002:90b:e006:84:250:56ff:fea5:1d86/64 scope global dynamic
valid_lft 2591875sec preferred_lft 604675sec
inet6 fe80::250:56ff:fea5:1d86/64 scope link
valid_lft forever preferred_lft forever
# ifconfig -a
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.251.161 netmask 255.255.254.0 broadcast 192.168.251.255
inet6 2002:90b:e006:84:250:56ff:fea5:1d86 prefixlen 64 scopeid 0x0<global>
inet6 fe80::250:56ff:fea5:1d86 prefixlen 64 scopeid 0x20<link>
ether 00:50:56:a5:1d:86 txqueuelen 1000 (Ethernet)
RX packets 2909840 bytes 1022774886 (975.3 MiB)
RX errors 0 dropped 2349 overruns 0 frame 0
TX packets 712595 bytes 12619844288 (11.7 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth0:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>mtu 1500
inet 192.168.251.165 netmask 255.255.254.0 broadcast 192.168.251.255
ether 00:50:56:a5:1d:86 txqueuelen 1000 (Ethernet)
# ip addr
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP>mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:a5:1d:86 brd ff:ff:ff:ff:ff:ff
inet 192.168.251.161/23 brd 9.11.85.255 scope global eth0
valid_lft forever preferred_lft forever
inet 192.168.251.165/23 brd 9.11.85.255 scope global secondary eth0:0
valid_lft forever preferred_lft forever
inet 192.168.251.166/23 brd 9.11.85.255 scope global secondary eth0:1
valid_lft forever preferred_lft forever
inet6 2002:90b:e006:84:250:56ff:fea5:1d86/64 scope global dynamic
valid_lft 2591838sec preferred_lft 604638sec
inet6 fe80::250:56ff:fea5:1d86/64 scope link
valid_lft forever preferred_lft forever
# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
# These are addresses for the base adapter used to alias CES-IPs to.
# Do not use these as CES-IPs.
# You could use these for a gpfs cluster if you choose
# Or you could leave these unused as placeholders
203.0.113.7 ss-deploy-cluster3-1_ces.example.com ss-deploy-cluster3-1_ces
203.0.113.10 ss-deploy-cluster3-2_ces.example.com ss-deploy-cluster3-2_ces
203.0.113.12 ss-deploy-cluster3-3_ces.example.com ss-deploy-cluster3-3_ces
203.0.113.14 ss-deploy-cluster3-4_ces.example.com ss-deploy-cluster3-4_ces
In this example, the first two sets of addresses have unique host names and the third set of addresses
that are associated with CES IPs are not unique. Alternatively, you could give each CES IP a unique host
name but this is an arbitrary decision because only the node itself can see its own /etc/hosts file.
Therefore, these host names are not visible to external clients/nodes unless they too contain a mirror
copy of the /etc/hosts file. The reason for containing the CES IPs within the /etc/hosts file is solely
to satisfy the IBM Storage Scale CES network verification checks. Without this, in cases with no DNS
server, CES IPs cannot be added to a cluster.
If any of the listed nodes are of an unsupported OS type, then they need to be removed by using the
following command:
If the node to be removed is an NSD node, then you might have to manually create NSDs and file systems
before using the installation toolkit.
The installation toolkit does not need to be made aware of preexisting file systems and NSDs that
are present on unsupported node types. Ensure that the file systems are mounted before running the
installation toolkit and that they point at the mount points or directory structures.
For information about how the installation toolkit can be used in a cluster that has nodes with mixed
operating systems, see Mixed operating system support with the installation toolkit in IBM Storage
Scale: Concepts, Planning, and Installation Guide.
# mmlsfs all -z
File system attributes for /dev/fs1:
====================================
flag value description
------------------- ------------------------ -----------------------------------
-z yes Is DMAPI enabled?
2. Shut down all functions that are using DMAPI and unmount DMAPI by using the following steps:
a. Shut down all functions that are using DMAPI. This includes HSM policies and IBM Spectrum
Archive.
b. Unmount the DMAPI file system from all nodes by using the following command:
# mmunmount fs1 -a
Note: If the DMAPI file system is also the CES shared root file system, then you must first shut
down GPFS on all protocol nodes before unmounting the file system.
i) Check whether the DMAPI file system is also the CES shared root file system, use the following
command:
ii) Compare the output of this command with that of Step 1 to determine whether the CES shared
root file system has DMAPI enabled.
iii) Shut down GPFS on all protocol nodes by using the following command:
# mmshutdown -N cesNodes
# mmchfs fs1 -z no
3. If GPFS was shut down on the protocol nodes in one of the preceding steps, start GPFS on the protocol
nodes by using the following command:
# mmstartup -N cesNodes
4. Remount the file system on all nodes by using the following command:
# mmmount fs1 -a
5. Proceed with using the installation toolkit as now it can be used on all file systems.
6. After the task is done by using the installation toolkit is completed, enable DMAPI by using the
following steps:
a. Unmount the DMAPI file system from all nodes.
The error occurs because the Ansible package might get removed after the upgrade to Ubuntu 22.04.
Resolve this issue as follows.
1. Manually install Ansible 2.9.15.
The error might occur because the path of the required Python version is not set correctly.
Resolve this issue as follows.
1. Set the path of the required Python version correctly.
2. Fix the yum environment issue such that there are no warnings or errors in the output of yum
commands.
3. Retry the installation by using the installation toolkit.
The error message on Red Hat Enterprise Linux or Ubuntu is similar to the following message:
Workaround:
1. Do one of the following steps:
• Create a fresh SSH key by using the -PEM option. For example, ssh-keygen -m PEM
• Convert the existing SSH key by using the following command.
Workaround:
1. Issue the following command on each node added in the installation toolkit cluster definition.
eval "$(ssh-agent)";
Workaround:
1. List all the scope files without a directory.
Could not get lock /var/lib/apt/lists/lock - open (11: Resource temporarily unavailable)
Unable to lock directory /var/lib/apt/lists/
Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily unavailable)
Unable to lock the administration directory (/var/lib/dpkg/), is another process using it?
This process might be running on multiple nodes. Therefore, you might need to issue this command
on each of these nodes. If the installation process failed after the creation of cluster, you can use the
mmdsh command to identify the apt-get process on each node it is running on.
3. Retry the installation toolkit setup. If the error persists, issue following commands and then try again:
rm /var/lib/apt/lists/lock
dpkg –configure -a
2. Proceed with the installation, deployment, or upgrade with the installation toolkit.
mmsdrrestore -p primaryServer
where primaryServer is the name of the primary GPFS cluster configuration server.
If the /var/mmfs/gen/mmsdrfs file is not present on the primary GPFS cluster configuration server, but
it is present on some other node in the cluster, restore the file by issuing these commands:
where remoteNode is the node that has an up-to-date version of the /var/mmfs/gen/mmsdrfs file, and
remoteFile is the full path name of that file on that node.
One way to ensure that the latest version of the /var/mmfs/gen/mmsdrfs file is always available is to
use the mmsdrbackup user exit.
If you have made modifications to any of the users exist in /var/mmfs/etc, you need to restore them
before starting GPFS.
For additional information, see “Recovery from loss of GPFS cluster configuration data file” on page 355.
mmdelnode -f
Workaround
1. Run the following command to manually remove the conflicting RPM:
Authorization problems
This topic describes issues with running remote commands due to authorization problems in IBM Storage
Scale.
The ssh and scp commands are used by GPFS administration commands to perform operations on other
nodes. The ssh daemon (sshd) on the remote node must recognize the command being run and must
obtain authorization to invoke it.
Note: Use the ssh and scp commands that are shipped with the OpenSSH package supported by GPFS.
Refer to the IBM Storage Scale FAQ in IBM Documentation for the latest OpenSSH information.
For more information, see “Problems due to missing prerequisites” on page 339.
For the ssh and scp commands issued by GPFS administration commands to succeed, each node in the
cluster must have an .rhosts file in the home directory for the root user, with file permission set to 600.
This .rhosts file must list each of the nodes and the root user. If such an .rhosts file does not exist
on each node in the cluster, the ssh and scp commands issued by GPFS commands fail with permission
errors, causing the GPFS commands to fail in turn.
Connectivity problems
This topic describes the issues with running GPFS commands on remote nodes due to connectivity
problems.
Another reason why ssh may fail is that connectivity to a needed node has been lost. Error messages
from mmdsh may indicate that connectivity to such a node has been lost. Here is an example:
mmdelnode -N k145n04
Verifying GPFS is stopped on all affected nodes ...
mmdsh: 6027-1617 There are no available nodes on which to run the command.
mmdelnode: 6027-1271 Unexpected error from verifyDaemonInactive: mmcommon onall.
Return code: 1
If error messages indicate that connectivity to a node has been lost, use the ping command to verify
whether the node can still be reached:
ping k145n04
PING k145n04: (119.114.68.69): 56 data bytes
<Ctrl- C>
----k145n04 PING Statistics----
3 packets transmitted, 0 packets received, 100% packet loss
If connectivity has been lost, restore it, then reissue the GPFS command.
mmcommon showLocks
The mmcommon showLocks command displays information about the lock server, lock name, lock
holder, PID, and extended information. If a GPFS administration command is not responding, stopping
the command unlocks the lock. If another process has this PID, another error occurred to the original
GPFS command, causing it to die without freeing the lock, and this new process has the same PID. If
this is the case, do not kill the process.
2. If any locks are held and you want to release them manually, from any node in the GPFS cluster issue
the command:
mmsdrrestore -p primaryServer
where primaryServer is the name of the primary GPFS cluster configuration server.
If the /var/mmfs/gen/mmsdrfs file is not present on the primary GPFS cluster configuration server, but
is present on some other node in the cluster, restore the file by issuing these commands:
where remoteNode is the node that has an up-to-date version of the /var/mmfs/gen/mmsdrfs file and
remoteFile is the full path name of that file on that node.
2. The GPFS kernel modules, mmfslinux and tracedev, are built with a kernel version that differs from
that of the currently running Linux kernel. This situation can occur if the modules are built on another
ps -e | grep mmfsd
The output of this command should list mmfsd as operational. For example:
If the output does not show this, the GPFS daemon needs to be started with the mmstartup
command.
3. If you did not specify the autoload option on the mmcrcluster or the mmchconfig command, you
need to manually start the daemon by issuing the mmstartup command.
ping nodename
to each node in the cluster. A properly working network and node correctly replies to the ping with no
lost packets.
Query the network interface that GPFS is using with:
netstat -i
lsfs -v mmfs
If any of these commands produce unexpected results, this may be an indication of corrupted GPFS
cluster configuration data file information. Follow the procedures in “Information to be collected
before contacting the IBM Support Center” on page 555, and then contact the IBM Support Center.
7. GPFS requires a quorum of nodes to be active before any file system operations can be honored.
This requirement guarantees that a valid single token management domain exists for each GPFS file
system. Prior to the existence of a quorum, most requests are rejected with a message indicating that
quorum does not exist.
To identify which nodes in the cluster have daemons up or down, issue:
mmgetstate -L -a
If insufficient nodes are active to achieve quorum, go to any nodes not listed as active and perform
problem determination steps on these nodes. A quorum node indicates that it is part of a quorum by
writing an mmfsd ready message to the GPFS log. Remember that your system may have quorum
nodes and non-quorum nodes, and only quorum nodes are counted to achieve the quorum.
8. This step applies only to AIX nodes. Verify that GPFS kernel extension is not having problems with its
shared segment by invoking:
cat /var/adm/ras/mmfs.log.latest
mmgetstate
mmlscluster
sshd: 0826-813 Permission is denied.
mmdsh: 6027-1615 k145n02 remote shell process had return code 1.
mmlscluster: 6027-1591 Attention: Unable to retrieve GPFS cluster files from node k145n02
sshd: 0826-813 Permission is denied.
mmdsh: 6027-1615 k145n01 remote shell process had return code 1.
mmlscluster: 6027-1592 Unable to retrieve GPFS cluster files from node k145n01
These messages indicate that ssh is not working properly on nodes k145n01 and k145n02.
If you encounter this type of failure, determine why ssh is not working on the identified node. Then fix
the problem.
4. Most problems encountered during file system creation fall into three classes:
• You did not create network shared disks which are required to build the file system.
• The creation operation cannot access the disk.
Follow the procedures for checking access to the disk. This can result from a number of factors
including those described in “NSD and underlying disk subsystem failures” on page 407.
• Unsuccessful attempt to communicate with the file system manager.
The file system creation runs on the file system manager node. If that node goes down, the mmcrfs
command may not succeed.
5. If the mmdelnode command was unsuccessful and you plan to permanently de-install GPFS from a
node, you should first remove the node from the cluster. If this is not done and you run the mmdelnode
command after the mmfs code is removed, the command fails and displays a message similar to this
example:
If this happens, power off the node and run the mmdelnode command again.
6. If you have successfully installed and are operating with the latest level of GPFS, but cannot run the
new functions available, it is probable that you have not issued the mmchfs -V full or mmchfs -V
compat command to change the version of the file system. This command must be issued for each of
your file systems.
In addition to mmchfs -V, you may need to run the mmmigratefs command. See the File system
format changes between versions of GPFS topic in the IBM Storage Scale: Administration Guide.
Quorum loss
Each GPFS cluster has a set of quorum nodes explicitly set by the cluster administrator.
These quorum nodes and the selected quorum algorithm determine the availability of file systems
owned by the cluster. For more information, see Quorum in IBM Storage Scale: Concepts, Planning, and
Installation Guide.
When quorum loss or loss of connectivity occurs, any nodes still running GPFS suspend the use of file
systems owned by the cluster experiencing the problem. This may result in GPFS access within the
suspended file system receiving ESTALE errors. Nodes continuing to function after suspending file system
access starts contacting other nodes in the cluster in an attempt to rejoin or reform the quorum. If they
succeed in forming a quorum, access to the file system is restarted.
Normally, quorum loss or loss of connectivity occurs if a node goes down or becomes isolated from its
peers by a network failure. The expected response is to address the failing condition.
c. Use the mmlsquota -j command to check the quota limit of the fileset. For example, using the
fileset name found in the previous step, issue this command:
mmlsquota -j myFileset -e
The mmlsquota output is similar when checking the user and group quota. If usage is equal to or
approaching the hard limit, or if the grace period has expired, make sure that no quotas are lost by
checking in doubt values.
If quotas are exceeded in the in doubt category, run the mmcheckquota command. For more
information, see “The mmcheckquota command” on page 325.
Note: There is no way to force GPFS nodes to relinquish all their local shares in order to check for
lost quotas. This can only be determined by running the mmcheckquota command immediately after
mounting the file system, and before any allocations are made. In this case, the value in doubt is the
amount lost.
To display the latest quota usage information, use the -e option on either the mmlsquota or the
mmrepquota commands. Remember that the mmquotaon and mmquotaoff commands do not enable
and disable quota management. These commands merely control enforcement of quota limits. Usage
continues to be counted and recorded in the quota files regardless of enforcement.
Reduce quota usage by deleting or compressing files or moving them out of the file system. Consider
increasing quota limit.
Windows issues
The topics that follow apply to Windows Server 2008.
bash-3.00$ ls -l -d ~
drwx------ 1 demyn Domain Users 0 Dec 5 11:53 /dev/fs/D/Users/demyn
bash-3.00$ ls -l -d ~/.ssh
drwx------ 1 demyn Domain Users 0 Oct 26 13:37 /dev/fs/D/Users/demyn/.ssh
bash-3.00$ ls -l ~/.ssh
total 11
drwx------ 1 demyn Domain Users 0 Oct 26 13:37 .
drwx------ 1 demyn Domain Users 0 Dec 5 11:53 ..
-rw-r--r-- 1 demyn Domain Users 603 Oct 26 13:37 authorized_keys2
-rw------- 1 demyn Domain Users 672 Oct 26 13:33 id_dsa
-rw-r--r-- 1 demyn Domain Users 603 Oct 26 13:33 id_dsa.pub
-rw-r--r-- 1 demyn Domain Users 2230 Nov 11 07:57 known_hosts
bash-3.00$
This issue typically occurs because of a conflict with the python2-cryptography package installed
on the system. The conflicting python2-cryptography package name contains the string ibm. For
example, python2-cryptography-x.x.x.x.ibm.el7.
• Check whether the python2-cryptography package that is installed on the system has ibm in the
package name.
rpm -q python2-cryptography
If the object protocol is enabled on your system and you want to continue using it, see Considerations
for upgrading from an operating system not supported in IBM Storage Scale 5.1.x.x in IBM Storage Scale:
Concepts, Planning, and Installation Guide.
Use the following workaround if:
• The object protocol is not installed or if you can uninstall the object protocol. You can check that the
object protocol is not enabled by using the mmces service list -a command.
• The correct python2-cryptography package is not installed.
Workaround:
1. Remove the spectrum-scale-object package if it exists.
After doing the workaround steps, verify that the correct python2-cryptography package is installed
on the system. The correct package has the default version without ibm in the name. For example,
python2-cryptography-x.x.x.x.el7.
rpm -q python2-cryptography
Workaround:
The workaround to this error is to refresh your browser cache.
[W] VERBS RDMA open error verbsPort <port> due to missing support for atomic operations for
device <device>
Workaround:
Check the description of the verbsRdmaWriteFlush configuration variable in the mmchconfig command
topic in IBM Storage Scale: Command and Programming Reference Guide for possible options.
The CCR check on the non-quorum node displays an output similar to this:
The following list provides descriptions for each CCR check item:
CCR_CLIENT_INIT
Verifies whether the CCR directory structure and files are complete and intact. It also verifies
whether the security layer that the CCR is using (GSKit) can be initialized successfully.
FC_CCR_AUTH_KEYS
Verifies that the CCR key file needed for authentication by the GSKit layer is available.
FC_CCR_PAXOS_CACHED and FC_CCR_PAXOS_12
Verify whether the CCR Paxos state files are available. On quorum nodes, these files are used during
CCR's consensus protocol. That is, a cached copy on every node in the cluster is used to speed up
the process in certain cases.
With this command, the CCR client/server connection can be tested. If the server in the specified node
list does not echo the testString , then it means that the connection between this client and server is not
working. In such scenarios, check whether the port 1191 is blocked by the firewall or wrong IP address
lookup due to inconsistent /etc/hosts entries.
• The CCR_DEBUG environment variable can be used in a CCR client command to print detailed console
output for debug purposes, as shown in the following example:
Mon Jun 25 22:23:36.298 2018: Close connection to 192.168.10.109 c5n109. Attempting reconnect.
Mon Jun 25 22:23:37.300 2018: Connecting to 192.168.10.109 c5n109
Mon Jun 25 22:23:37.398 2018: Close connection to 192.168.10.109 c5n109
Mon Jun 25 22:23:38.338 2018: Recovering nodes: 9.114.132.109
Mon Jun 25 22:23:38.722 2018: Recovered 1 nodes.
Nodes mounting file systems owned and served by other clusters may receive error messages similar to
this:
If a sufficient number of nodes fail, GPFS loses the quorum of nodes, which exhibits itself by messages
appearing in the GPFS log, similar to this:
When either of these cases occur, perform problem determination on your network connectivity. Failing
components could be network hardware such as switches or host bus adapters.
This command runs several types of connectivity checks between each node and all the other nodes in
the group and reports the results on the console. Because a cluster does not exist yet, you must include
a configuration file File in which you list all the nodes that you want to test.
• To check for network outages in a cluster, you can run the following command:
This command runs several types of ping checks between each node and all the other nodes in the
cluster and reports the results on the console.
• Before you make a node a quorum node, you can run the following check to verify that other nodes can
communicate with the daemon:
• To investigate a possible lag in large-data transfers between two nodes, you can run the following
command:
This command establishes a TCP connection from node2 to node3 and causes the two nodes to
exchange a series of large-sized data messages. If the bandwidth falls below the level that is specified,
the command generates an error. The output of the command to the console indicates the results of the
test.
• To analyze a problem with connectivity between nodes, you can run the following command:
This command runs connectivity checks between each node and all the other nodes in the cluster, one
pair at a time, and writes the results of each test to the console and to the specified log file.
starting ...
mounting ...
mounted ....
12. If quotas are enabled, check if there was an error while reading quota files. See “MMFS_QUOTA” on
page 276.
13. Verify the maxblocksize configuration parameter on all clusters involved. If maxblocksize is less
than the block size of the local or remote file system you are attempting to mount, you cannot mount
it.
14. If the file system has encryption rules, see “Mount failure for a file system with encryption rules” on
page 435.
15. To mount a file system on a remote cluster, ensure that the cluster that owns and serves the file
system and the remote cluster have proper authorization in place. The authorization between clusters
is set up with the mmauth command.
Authorization errors on AIX are similar to the following:
For more information about mounting a file system that is owned and served by another GPFS cluster,
see the Mounting a remote GPFS file system topic in the IBM Storage Scale: Administration Guide.
Mount failure due to client nodes joining before NSD servers are online
While mounting a file system, specially during automounting, if a client node joins the GPFS cluster and
attempts file system access prior to the file system's NSD servers being active, the mount fails. Use
mmchconfig command to specify the amount of time for GPFS mount requests to wait for an NSD server
to join the cluster.
If a client node joins the GPFS cluster and attempts file system access prior to the file system's NSD
servers being active, the mount fails. This is especially true when automount is used. This situation can
occur during cluster startup, or any time that an NSD server is brought online with client nodes already
active and attempting to mount a file system served by the NSD server.
The file system mount failure produces a message similar to this:
Mon Jun 25 11:23:34 EST 2007: mmmount: Mounting file systems ...
No such device
Some file system data are inaccessible at this time.
Check error log for additional information.
After correcting the problem, the file system must be unmounted and then
mounted again to restore normal data access.
Failed to open fs1.
No such device
Some file system data are inaccessible at this time.
Cannot mount /dev/fs1 on /fs1: Missing file or filesystem
Two mmchconfig command options are used to specify the amount of time for GPFS mount requests to
wait for an NSD server to join the cluster:
nsdServerWaitTimeForMount
Specifies the number of seconds to wait for an NSD server to come up at GPFS cluster startup time,
after a quorum loss, or after an NSD server failure.
Valid values are between 0 and 1200 seconds. The default is 300. The interval for checking is
10 seconds. If nsdServerWaitTimeForMount is 0, nsdServerWaitTimeWindowOnMount has no
effect.
nsdServerWaitTimeWindowOnMount
Specifies a time window to determine if quorum is to be considered recently formed.
the file system does not unmount until all processes are finished accessing it. If mmfsd is up, the
processes accessing the file system can be determined. See “The lsof command” on page 318. These
processes can be killed with the command:
If mmfsd is not operational, the lsof command is not able to determine which processes are still
accessing the file system.
For Linux nodes it is possible to use the /proc pseudo file system to determine current file access.
For each process currently running on the system, there is a subdirectory /proc/pid/fd, where pid
is the numeric process ID number. This subdirectory is populated with symbolic links pointing to the
files that this process has open. You can examine the contents of the fd subdirectory for all running
processes, manually or with the help of a simple script, to identify the processes that have open
files in GPFS file systems. Terminating all of these processes may allow the file system to unmount
successfully.
To unmount a CES protocol node, suspend the CES function using the following command:
umount -f /fileSystem
4. If a file system that is mounted by a remote cluster needs to be unmounted, you can force the
unmount by issuing the command:
mmchfs Device -Q no
Error numbers specific to GPFS application calls when a file system has
been forced to unmount
There are error numbers to indicate that a file system is forced to unmount for GPFS application calls.
When a file system has been forced to unmount, GPFS may report these error numbers in the operating
system error log or return them to an application:
EPANIC = 666, A file system has been forcibly unmounted because of an error. Most likely due to the
failure of one or more disks containing the last copy of metadata.
See “Operating system error logs” on page 274 for details.
EALL_UNAVAIL = 218, A replicated read or write failed because none of the replicas were available.
Multiple disks in multiple failure groups are unavailable. Follow the procedures in Chapter 25, “Disk
issues,” on page 407 for unavailable disks.
For Red Hat Enterprise Linux 5, verify the following line is in the default master map file (/etc/
auto.master):
/gpfs/automountdir program:/usr/lpp/mmfs/bin/mmdynamicmap
/gpfs/automountdir program:/usr/lpp/mmfs/bin/mmdynamicmap
This is an autofs program map, and there is a single mount entry for all GPFS automounted file
systems. The symbolic link points to this directory, and access through the symbolic link triggers the
mounting of the target GPFS file system. To create this GPFS autofs mount, issue the mmcommon
startAutomounter command, or stop and restart GPFS using the mmshutdown and mmstartup
commands.
3. Verify that the automount daemon is running. Issue this command:
For Red Hat Enterprise Linux 5, verify that the autofs daemon is running. Issue this command:
To start the automount daemon, issue the mmcommon startAutomounter command, or stop and
restart GPFS using the mmshutdown and mmstartup commands.
Note: If automountdir is mounted (as in step 2) and the mmcommon startAutomounter command
is not able to bring up the automount daemon, manually umount the automountdir before issuing
the mmcommon startAutomounter again.
4. Verify that the mount command was issued to GPFS by examining the GPFS log. You should see
something like this:
5. Examine /var/log/messages for autofs error messages. The following is an example of what you
might see if the remote file system name does not exist.
6. After you have established that GPFS has received a mount request from autofs (Step “4” on page
386) and that mount request failed (Step “5” on page 386), issue a mount command for the GPFS file
system and follow the directions in “File system fails to mount” on page 377.
These are direct mount autofs mount entries. Each GPFS automount file system have an autofs
mount entry. These autofs direct mounts allow GPFS to mount on the GPFS mount point. To create
any missing GPFS autofs mounts, issue the mmcommon startAutomounter command, or stop and
restart GPFS using the mmshutdown and mmstartup commands.
3. Verify that the autofs daemon is running. Issue this command:
To start the automount daemon, issue the mmcommon startAutomounter command, or stop and
restart GPFS using the mmshutdown and mmstartup commands.
4. Verify that the mount command was issued to GPFS by examining the GPFS log. You should see
something like this:
5. Since the autofs daemon logs status using syslogd, examine the syslogd log file for status
information from automountd. Here is an example of a failed automount request:
6. After you have established that GPFS has received a mount request from autofs (Step “4” on page
387) and that mount request failed (Step “5” on page 387), issue a mount command for the GPFS file
system and follow the directions in “File system fails to mount” on page 377.
7. If automount fails for a non-GPFS file system and you are using file /etc/auto.master, use
file /etc/auto_master instead. Add the entries from /etc/auto.master to /etc/auto_master
and restart the automount daemon.
Remote file system I/O fails with the “Function not implemented” error
message when UID mapping is enabled
There are error messages when remote file system has an I/O failure and the course of action that you can
take to correct this issue.
When user ID (UID) mapping in a multi-cluster environment is enabled, certain kinds of mapping
infrastructure configuration problems might result in I/O requests on a remote file system failing:
Remote file system does not mount due to differing GPFS cluster security
configurations
There are indications leading you to the conclusion that the remote file system does not mount and
courses of action you can take to correct the problem.
A mount command fails with a message similar to this:
The GPFS log on the cluster issuing the mount command should have entries similar to these:
The GPFS log file on the cluster that owns and serves the file system has an entry indicating the problem
as well, similar to this:
Mon Jun 25 16:32:21 2007: Kill accepted connection from 199.13.68.12 because security is
required, err 74
To resolve this problem, contact the administrator of the cluster that owns and serves the file system to
obtain the key and register the key using mmremotecluster command.
The SHA digest field of the mmauth show and mmremotecluster commands may be used to determine
if there is a key mismatch, and on which cluster the key should be updated. For more information on the
SHA digest, see “The SHA digest” on page 329.
The remote cluster name does not match the cluster name supplied by the
mmremotecluster command
There are error messages that gets displayed when the remote cluster name does not match with the
cluster name that is provided by the mmremotecluster command, and the courses of action you can
take to correct the problem.
A mount command fails with a message similar to this:
mmlscluster
GPFS: 6027-510 Cannot mount /dev/gpfs22 on /gpfs22: A remote host did not respond
within the timeout period.
The NSD disk does not have an NSD server specified, and the mounting
cluster does not have direct access to the disks
There are error messages that gets displayed if the file system mounting gets failed, and the courses of
action that you can take to correct the problem.
A file system mount fails with a message similar to this:
To resolve the problem, the cluster that owns and serves the file system must define one or more NSD
servers.
The mmchconfig cipherlist=AUTHONLY command must be run on both the cluster that owns and
controls the file system, and the cluster that is attempting to mount the file system.
Error numbers specific to GPFS application calls when file system manager
appointment fails
Certain error numbers and messages are displayed when the file system manager appointment fails .
When the appointment of a file system manager is unsuccessful after multiple attempts, GPFS may report
these error numbers in error logs, or return them to an application:
ENO_MGR = 212, The current file system manager failed and no new manager could be appointed.
This usually occurs when a large number of disks are unavailable or when there has been a major
network failure. Run mmlsdisk to determine whether disks have failed and take corrective action if
they have by issuing the mmchdisk command.
Before a disk is added to or removed from a file system, a check is made that the GPFS configuration data
for the file system is in agreement with the on-disk data for the file system. The preceding message is
issued if this check was not successful. This may occur if an earlier GPFS disk command was unable to
complete successfully for some reason. Issue the mmcommon recoverfs command to bring the GPFS
configuration data into agreement with the on-disk data for the file system.
If running mmcommon recoverfs does not resolve the problem, follow the procedures in “Information to
be collected before contacting the IBM Support Center” on page 555, and then contact the IBM Support
Center.
A NO_SPACE error occurs when a file system is known to have adequate free
space
The GPFS commands display a NO_SPACE error even if a file system has free space and the course of
actions that you can take to correct this issue.
A ENOSPC (NO_SPACE) message can be returned even if a file system has remaining space. The
NO_SPACE error might occur even if the df command shows that the file system is not full.
The user might have a policy that writes data into a specific storage pool. When the user tries to create
a file in that storage pool, it returns the ENOSPC error if the storage pool is full. The user next issues the
df command, which indicates that the file system is not full, because the problem is limited to the one
storage pool in the user's policy. In order to see if a particular storage pool is full, the user must issue the
mmdf command.
The following is a sample scenario:
1. The user has a policy rule that says files whose name contains the word 'tmp' should be put into
storage pool sp1 in the file system fs1. This command displays the rule:
mmlspolicy fs1 -L
2. The user moves a file from the /tmp directory to fs1 that has the word 'tmp' in the file name, meaning
data of tmpfile should be placed in storage pool sp1:
mv /tmp/tmpfile /fs1/
df |grep fs1
This output indicates that the file system is only 51% full.
4. To query the storage usage for an individual storage pool, the user must issue the mmdf command.
mmdf fs1
============= ====================
===================
(data) 280190976 139840000 ( 50%) 20184
( 0%)
(metadata) 140095488 139840000 (100%) 19936
( 0%)
============= ====================
===================
(total) 280190976 139840000 ( 50%) 20184
( 0%)
Inode Information
------------------
Number of used inodes: 74
Number of free inodes: 137142
Number of allocated inodes: 137216
Maximum number of inodes: 150016
In this case, the user sees that storage pool sp1 has 0% free space left and that is the reason for the
NO_SPACE error message.
Negative values occur in the 'predicted pool utilizations', when some files
are 'ill-placed'
A scenario where an ill-placed files may cause GPFS to produce a 'Predicted Pool Utilization' of a negative
value and the course of action that you can take to resolve this issue.
This is a hypothetical situation where ill-placed files can cause GPFS to produce a 'Predicted Pool
Utilization' of a negative value.
Suppose that 2 GB of data from a 5 GB file named abc, that is supposed to be in the system storage pool,
are actually located in another pool. This 2 GB of data is said to be 'ill-placed'. Also, suppose that 3 GB of
this file are in the system storage pool, and no other file is assigned to the system storage pool.
If you run the mmapplypolicy command to schedule file abc to be moved from the system storage pool
to a storage pool named YYY, the mmapplypolicy command does the following:
1. Starts with the 'Current pool utilization' for the system storage pool, which is 3 GB.
2. Subtracts 5 GB, the size of file abc.
3. Arrives at a 'Predicted Pool Utilization' of negative 2 GB.
The mmapplypolicy command does not know how much of an 'ill-placed' file is currently in the wrong
storage pool and how much is in the correct storage pool.
When there are ill-placed files in the system storage pool, the 'Predicted Pool Utilization' can be any
positive or negative value. The positive value can be capped by the LIMIT clause of the MIGRATE rule.
The 'Current® Pool Utilizations' should always be between 0% and 100%.
On the other hand, the rm -r command deletes all the files that are contained in the filesets that
are linked under the specified directory. Use the mmunlinkfileset command to remove fileset
junctions.
2. Files and directories are moved from one fileset to another, or a hard link cross fileset boundary.
If the user is unaware of the locations of fileset junctions, then the mv and ln commands might fail
unexpectedly. In most cases, the mv command automatically compensates for this failure and uses a
combination of cp and rm to accomplish the desired result. Use the mmlsfileset command to view
the locations of fileset junctions. Use the mmlsattr -L command to determine the fileset for any
given file.
3. Because a snapshot saves the contents of a fileset, deleting a fileset that is included in a snapshot
cannot completely remove the fileset.
The fileset is put into a 'deleted' state and continues to appear in the mmlsfileset command
output. Once the last snapshot that is containing the fileset is deleted, the fileset is automatically
removed. The mmlsfileset --deleted command indicates deleted filesets and shows their names
in parentheses.
4. Deleting a large fileset might take some time and might be interrupted by other failures, such as disk
errors or system crashes.
When this occurs, the recovery action leaves the fileset in a being deleted state. Such a fileset might
not be linked into the namespace. The corrective action it to finish the deletion by reissuing the fileset
delete command:
The mmlsfileset command identifies filesets in this state by displaying a status of 'Deleting'.
5. If you unlink a fileset that has other filesets that are linked below it, then any filesets that are linked
to it (that is, child filesets) becomes inaccessible. The child filesets remain linked to the parent and
becomes accessible again when the parent is relinked.
6. By default, the mmdelfileset command does not delete a fileset that is not empty.
To empty a fileset, first unlink all its immediate child filesets to remove their junctions from the fileset
to be deleted. Then, while the fileset itself is still linked, use the rm -rf or a similar command
to remove the rest of the contents of the fileset. Now the fileset might be unlinked and deleted.
Alternatively, the fileset to be deleted can be unlinked first and then the mmdelfileset command can
be used with the -f (force) option. This unlinks its child filesets, then deletes the files and directories
that are contained in the fileset.
7. When a small dependent fileset is deleted, it might be faster to use the rm -rf command instead of
the mmdelfileset command with the -f option.
Error numbers specific to GPFS application calls when data integrity may be
corrupted
If there is a possibility of the corruption of data integrity, GPFS displays specific error messages or returns
them to the application.
When there is the possibility of data corruption, GPFS may report these error numbers in the operating
system error log, or return them to an application:
EVALIDATE=214, Invalid checksum or other consistency check failure on disk data structure.
This indicates that internal checking has found an error in a metadata structure. The severity of the
error depends on which data structure is involved. The cause of this is usually GPFS software, disk
hardware or other software between GPFS and the disk. Running mmfsck should repair the error. The
urgency of this depends on whether the error prevents access to some file or whether basic metadata
structures are involved.
mmlsnsd -m
mmlsnsd -d t65nsd4b -M
mmlsnsd -X -d "hd3n97;sdfnsd;hd5n98"
mmcrnsd -F StanzaFile -v no
A possible cause for the NSD creation error message is that a previous mmdelnsd command failed to
zero internal data structures on the disk, even though the disk is functioning correctly. To complete the
deletion, run the mmdelnsd command with the -p NSDId option. Do not take this step unless you are
sure that another cluster is not using this disk. The following command is an example:
Feb 16 13:11:18 host123 kernel: SCSI device sdu: 35466240 512-byte hdwr sectors (18159 MB)
Feb 16 13:11:18 host123 kernel: sdu: I/O error: dev 41:40, sector 0
Feb 16 13:11:18 host123 kernel: unable to read partition table
On AIX, consult “Operating system error logs” on page 274 for hardware configuration error log entries.
Accessible disk devices will generate error log entries similar to this example for a SSA device:
--------------------------------------------------------------------------
LABEL: SSA_DEVICE_ERROR
IDENTIFIER: FE9E9357
Description
DISK OPERATION ERROR
Probable Causes
DASD DEVICE
Failure Causes
DISK DRIVE
Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES
Detail Data
ERROR CODE
2310 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
---------------------------------------------------------------------------
---------------------------------------------------------------------------
LABEL: MMFS_DISKFAIL
IDENTIFIER: 9C6C05FA
Description
DISK FAILURE
Probable Causes
STORAGE SUBSYSTEM
DISK
Failure Causes
STORAGE SUBSYSTEM
DISK
Recommended Actions
CHECK POWER
RUN DIAGNOSTICS AGAINST THE FAILING DEVICE
Detail Data
EVENT CODE
1027755
VOLUME
fs3
RETURN CODE
19
PHYSICAL VOLUME
vp31n05
-----------------------------------------------------------------
mmumount fs1 -a
3. If you are replacing the disk, add the new disk to the file system:
mmrestripefs fs1 -b
Note: Ensure there is sufficient space elsewhere in your file system for the data to be stored by using
the mmdf command.
mmlsnsd -m
The system displays any underlying physical device present on this node, which is backing the NSD. If the
underlying device is a logical volume, issue the following command to map from the logical volume to the
volume group.
lsvg -o | lsvg -i -l
lsvg gpfs1vg
Here the output shows that on each of the five nodes the volume group gpfs1vg is the same physical disk
(has the same pvid). The hdisk numbers vary, but the fact that they may be called different hdisk names
on different nodes has been accounted for in the GPFS product. This is an example of a properly defined
volume group.
If any of the pvids were different for the same volume group, this would indicate that the same volume
group name has been used when creating volume groups on different physical volumes. This will not work
for GPFS. A volume group name can be used only for the same physical volume shared among nodes in
a cluster. For more information, refer to AIX in IBM Documentation and search for operating system and
device management.
mmlsdisk fs1 -e
GPFS will mark disks down if there have been problems accessing the disk.
2. To prevent any I/O from going to the down disk, issue these commands immediately:
Note: If there are any GPFS file systems with pending I/O to the down disk, the I/O will timeout if the
system administrator does not stop it.
To see if there are any threads that have been waiting a long time for I/O to complete, on all nodes
issue:
3. The next step is irreversible! Do not run this command unless data and metadata have been replicated.
This command scans file system metadata for disk addresses belonging to the disk in question, then
replaces them with a special "broken disk address" value, which might take a while.
CAUTION: Be extremely careful with using the -p option of mmdeldisk, because by design it
destroys references to data blocks, making affected blocks unavailable. This is a last-resort
tool, to be used when data loss might have already occurred, to salvage the remaining data–
which means it cannot take any precautions. If you are not absolutely certain about the state of
the file system and the impact of running this command, do not attempt to run it without first
contacting the IBM Support Center.
Replica mismatches
IBM Storage Scale includes logic that ensures data and metadata replicas always have identical content.
A replica mismatch is a condition in which two or more replicas of a data or metadata block differ from
each other. Replica mismatches might happen for the following reasons:
• Corrupted log files (replica writes are protected by IBM Storage Scale journals) found by IBM Storage
Scale.
• Hardware issues such as a missed write or a redirected write that led to stale or corrupted disk blocks.
• Software bugs in administrative options for commands such as mmchdisk start and mmrestripefs
that fail to sync the replicas.
When a replicated block has mismatched replicas, the wrong replica block can be read and cause further
data integrity issues or application errors.
If the mismatch is in a metadata block and is found by IBM Storage Scale to be corrupted, it is flagged
in the system log as an FSSTRUCT error. The next replica block is then read. While this action does not
disrupt file system operations, it leaves the metadata block with an insufficient number of good replicas.
A failure of the disk that contains the good replica can lead to data or metadata loss. Alternatively, if the
replica block that is read contains valid but stale metadata, then it can lead to further corruption of the
data and metadata. For instance, if the block belongs to the block allocation map system file, then reading
a stale replica block of this file means that IBM Storage Scale sees a stale block allocation state. This
issue might allow IBM Storage Scale to double allocate blocks, which leads to corruption due to block
overwrite. You can identify such a corruption by looking for the FSErrDeallocBlock FSSTRUCT error in the
system log, which is logged at the time of the block's deallocation.
If the mismatch is in a data block, then IBM Storage Scale cannot determine whether the replica that is
read is corrupted since it does not have enough context to validate user data. Hence, applications might
receive corrupted data even when a good replica of the data block is available in the file system.
For these reasons, it is important to be able to repair replica mismatches as soon as they are detected.
You can detect replica mismatches with any of the following methods:
• With an online replica compare:
Note: This option detects data and directory block replica mismatches only.
Any of these methods can be used when file reads return stale or invalid data, or if FSErrDeallocBlock
FSSTRUCT errors are in the system logs.
If replica mismatches are detected, then the next step is to make the replicas consistent. The replicas can
be made consistent by choosing a reference replica block to copy over the other replicas of the block. If
they are metadata block replica mismatches, the reference replica block is chosen by IBM Storage Scale.
However, if they are data block replica mismatches, the reference replica should be chosen by the file
owner.
This command will report replica mismatches with output similar to the following:
…
Scanning user file metadata …
Inode 9824 [fileset 0, snapshot 0 ] has mismatch in replicated disk address 1:173184 3:135808
at block 207
Inode 10304 [fileset 0, snapshot 0 ] has mismatch in replicated disk address 2:145792 1:145920
at block 1
…
You can choose from two methods to resolve the data block replica mismatches. Both options require the
data block replica repair feature to be enabled with the following command:
mmchconfig readReplicaRuleEnabled=yes -i
This configuration option can be enabled and disabled dynamically, so you do not need to restart the
GPFS daemon. You might experience a small impact on performance for file reads due to enabling
the configuration option. It is advised to turn off this configuration option after all of the data block
replica mismatches are repaired. Also, it is advised to turn off this configuration option while performing
operations like restriping and rebalancing the file system. For more information about the two methods
to resolve the data block replica mismatches, see “Repairing data block replica mismatches with the
global replica selection rule” on page 417 and “Repairing data block replica mismatches with the file level
replica selection rule” on page 418.
Repairing data block replica mismatches with the global replica selection rule
Follow this procedure if the data block replica mismatches are due to one or more bad disks.
You can confirm if one or more disks are bad by looking at the frequency of disks that contain mismatched
replicas in the online replica compare operation output. Follow the steps to exclude and repair the data
block replica mismatches.
1. To exclude the bad disks from being read, run the following command:
mmchconfig diskReadExclusionList=<nsd1;nsd2;...> -i
Setting this configuration option prevents the read of data blocks from the specified disks when the
disks have one of the following statuses: ready, suspended, or replacement. If all of the replicas of a
data block are on read-excluded disks, then the data block is fetched from the disk that was specified
earlier in the diskReadExclusionList.
Where SnapPath is the path to the snapshot root directory, which contains the InodeNumber with
replica mismatches. If the replica mismatch is for a file in the active file system, then SnapPath would
be the path of the root directory of the active file system. For example:
Run this command on each of the inodes that were reported by the earlier online replica compare
operation.
4. To disable the diskReadExclusionList configuration option, run the following command:
mmchconfig diskReadExclusionList=DEFAULT -i
This method provides a fast way to exclude data block reads from disks with stale data. To exercise
more granular control over which data block replicas are read per file, see “Repairing data block replica
mismatches with the file level replica selection rule” on page 418.
Repairing data block replica mismatches with the file level replica selection
rule
Follow this procedure if you want to select the reference data block replica among mismatched replicas
on a per file basis.
This method allows for more granular control over which data block replicas are read for a file. Any
user who has permission to write on the file can use this method. Before you start, make sure that you
determine the list of the mismatched replicas of the file with one of the following commands:
You can figure out the correct data block replicas among the mismatched replicas of the file by setting
a file-specific replica selection rule. This rule is in the form of an extended attribute that is called
readReplicaRule, which is under the gpfs namespace. This rule causes a subsequent read of the file to
return the data block replicas as specified by the rule. You can then validate the file data by processing it
through an associated application.
Note: Setting this extended attribute invalidates existing caches of the file so that subsequent reads of
the file fetch the data from disk.
1. Set the gpfs.readReplicaRule extended attribute with one of the following methods:
3. You can also set the extended attributed by using an inode number instead of the file name:
The gpfs.readReplicaRule extended attribute can be set on any valid file (not a directory or soft link)
including clone parents, snapshot user files, and files with immutable or append-only flags. In most
cases, to set this extended attribute, you need permission to write on the file. However, if you own the
file, you can set the attribute even if you have only the permission to read the file. This attribute can be
set even on a read-only mounted file system.
This attribute is specific to a file, so it does not get copied during DMAPI backup, AFM, clone creation,
or snapshot copy-on-write operations. Similarly, this attribute cannot be restored from a snapshot.
If you do not have enough space to store the gpfs.readReplicaRule extended attribute, then you can
temporarily delete one or more of the existing user-defined extended attributes. After you repair the
replica mismatches in the file, you can delete the gpfs.readReplicaRule extended attribute and restore the
earlier user-defined attributes.
To save and restore all of the extended attributes for a file, run the following commands:
setfattr –restore=/tmp/attr.save
Note: It is possible to filter out all of the replicas of a data block by using the gpfs.readReplicaRule
extended attribute. In such a case, the block read fails with I/O error.
Files with the gpfs.readReplicaRule extended attribute might experience a small impact on read
performance because of parsing the rule string for every data block that is read from the disk. Thus,
it is advised to delete the gpfs.readReplicaRule extended attribute after you repair the data block replica
mismatches.
rule
= block_sub_rule | file_sub_rule |
(block_sub_rule, "; ", file_sub_rule);
Note: The block_sub_rule applies to the specified blocks only; the file_sub_rule
applies to all of the blocks in the file; sub_rules are evaluated left to right and
Note: disk_exclusion_list discards matching disk addresses from the replica set.
When a data block of a file is read from disk, all of the replica disk addresses of the data block are
filtered by either an existing gpfs.readReplicaRule extended attribute or an existing diskReadExclusionList
configuration option. If both are present, then the gpfs.readReplicaRule extended attribute is evaluated
first. If it fails to match the block that is being read, then the diskReadExclusionList configuration option
is applied instead. The data block is then read using only the filtered set of replica disk addresses. If the
read fails due to an I/O error or because the filtered set of replica disk addresses is empty, then the error
is returned to the application.
For more information, see “Example of using the gpfs.readReplicaRule string” on page 420.
mmchattr --set-attr gpfs.readReplicaRule="b=1:2 r=1,x; b=3 r=1; b=3 r=0; d=3,1" <filename>
The following data block disk address list is for a file with 2 DataReplicas and 3 MaxDataReplicas:
The replica rule picks the reference replica for each block that is read as follows:
If you have permission to read a file, you can check whether the gpfs.readReplicaRule extended attribute
is set in the file by one of the following methods:
Alternatively, a policy rule such as the following example can be used to show the gpfs.readReplicaRule
extended attribute:
DEFINE(DISPLAY_NULL,[COALESCE($1,'_NULL_')])
RULE EXTERNAL LIST 'files' EXEC ''
RULE LIST 'files' DIRECTORIES_PLUS SHOW(DISPLAY_NULL(XATTR('ATTR')))
The extended attribute can also be queried by using the inode number instead of the file name:
Note: This action requires root privilege.
To verify the effects of a gpfs.readReplicaRule string, you can dump the distribution of disk numbers for
each replica of each block of the file by using any of the following commands:
mmlsattr -D <FilePath>
Running one of those commands provides output similar to the following example:
A disk number that is prefixed with an asterisk indicates the data block replica that is read from the disk
for that block. By default, the first valid data block replica is always returned on read. An e at the end of
a row indicates that the gpfs.readReplicaRule selected an invalid replica and hence the read of this block
returns an error. You can change which data block replica is read by changing the readReplicaPolicy global
configuration option, the diskReadExclusionList global configuration option, or the gpfs.readReplicaRule
extended attribute. Thus, using a memory dump is a good way to check the effects of these settings.
mmrestripefile -c <Filename>
Note: To run this command, the file system must be read and write mounted. Additionally, you must have
permission to write on the file. This command is always audited to syslog for non-root users.
After the repair is completed, you can delete the gpfs.readReplicaRule extended attribute. Alternatively,
you can defer repair of the file and continue to do read and write operations on the file with the
gpfs.readReplicaRule extended attribute present.
The following steps summarize the workflow:
1. Identify the mismatched data block replicas in a file.
2. Ensure that the readReplicaRuleEnabled global configuration option is set to yes.
3. Write the gpfs.readReplicaRule extended attribute to select a replica index for each data block with
mismatched replicas.
4. Verify that the gpfs.readReplicaRule extended attribute selects the replicas as expected with the
mmlsattr -D command.
5. Validate the file by processing it through its associated application.
Note: The validation process should involve only reads of the file. Any attempt to write to the
blocks with mismatched replicas will overwrite all replicas. If the replica that is selected by the
gpfs.readReplicaRule extended attribute is incorrect, then writing to the block using the bad replica will
permanently corrupt the block.
6. If the file validation fails, then retry steps 2, 3, and 4 with a different replica index for the problem data
blocks.
7. After the file passes validation, repair the data block replica mismatches in the file with the
mmrestripefile -c command.
8. Delete the gpfs.readReplicaRule extended attribute.
You can rebalance the file system at the same time by issuing:
mmrestripefs fs1 -r
Optionally, use the -b flag instead of the -r flag to rebalance across all disks.
Note: Rebalancing of files is an I/O intensive and time consuming operation, and is important only for
file systems with large files that are mostly invariant. In many cases, normal file update and creation
will rebalance your file system over time, without the cost of the rebalancing.
3. Optionally, check the file system for metadata inconsistencies by issuing the offline version of mmfsck:
mmfsck fs1
If mmfsck succeeds, you may still have errors that occurred. Check to verify no files were lost. If files
containing user data were lost, you will have to restore the files from the backup media.
If mmfsck fails, sufficient metadata was lost and you need to recreate your file system and restore the
data from backup media.
Strict replication
Use mmchfs -K no command to perform disk action for strict replication.
If data or metadata replication is enabled, and the status of an existing disk changes so that the disk
is no longer available for block allocation (if strict replication is enforced), you may receive an errno of
ENOSPC when you create or append data to an existing file. A disk becomes unavailable for new block
allocation if it is being deleted, replaced, or it has been suspended. If you need to delete, replace, or
suspend a disk, and you need to write new data while the disk is offline, you can disable strict replication
by issuing the mmchfs -K no command before you perform the disk action. However, data written while
replication is disabled will not be replicated properly. Therefore, after you perform the disk action, you
must re-enable strict replication by issuing the mmchfs -K command with the original value of the -K
option (always or whenpossible) and then run the mmrestripefs -r command. To determine if a
disk has strict replication enforced, issue the mmlsfs -K command.
Note: A disk in a down state that has not been explicitly suspended is still available for block allocation,
and thus a spontaneous disk failure will not result in application I/O requests failing with ENOSPC. While
new blocks will be allocated on such a disk, nothing will actually be written to the disk until its availability
changes to up following an mmchdisk start command. Missing replica updates that took place while
the disk was down will be performed when mmchdisk start runs.
No replication
Perform unmounting yourself if no replication has been done and the system metadata has been lost. You
can follow the course of actions for manual unmounting.
When there is no replication, the system metadata has been lost and the file system is basically
irrecoverable. You may be able to salvage some of the user data, but it will take work and time. A forced
unmount of the file system will probably already have occurred. If not, it probably will very soon if you try
to do any recovery work. You can manually force the unmount yourself:
mount -o ro /dev/fs1
2. If you read a file in block-size chunks and get an EIO return code that block of the file has been
lost. The rest of the file may have useful data to recover or it can be erased. To save the file system
parameters for recreation of the file system, issue:
mmdelfs fs1
3. To repair the disks, see your disk vendor problem determination guide. Follow the problem
determination and repair actions specified.
4. Delete the affected NSDs. Issue:
mmdelnsd nsdname
5. Create a disk descriptor file for the disks to be used. This will include recreating NSDs for the new file
system.
6. Recreate the file system with either different parameters or the same as you used before. Use the disk
descriptor file.
7. Restore lost data from backups.
Error numbers specific to GPFS application calls when disk failure occurs
There are certain error numbers associated with GPFS application calls when disk failure occurs.
When a disk failure has occurred, GPFS may report these error numbers in the operating system error log,
or return them to an application:
EOFFLINE = 208, Operation failed because a disk is offline
This error is most commonly returned when an attempt to open a disk fails. Since GPFS will attempt
to continue operation with failed disks, this will be returned when the disk is first needed to complete
If needed, use the AIX chdev command to set reserve_policy and PR_key_value.
Note: GPFS manages reserve_policy and PR_key_value using reserve_policy=PR_shared when
Persistent Reserve support is enabled and reserve_policy=no_reserve when Persistent Reserve
is disabled.
/usr/lpp/mmfs/bin/tsprreadkeys hdiskx
2. To check the AIX ODM status of a single disk on a node, issue the following command from a node that
has access to the disk:
/usr/lpp/mmfs/bin/tsprreadkeys sdp
If the registered key values all start with 0x00006d, which indicates that the PR registration was
issued by GPFS, proceed to the next step to verify the SCSI-3 PR reservation type. Otherwise, contact
your system administrator for information about clearing the disk state.
2. Display the reservation type on the disk:
/usr/lpp/mmfs/bin/tsprreadres sdp
yes:LU_SCOPE:WriteExclusive-AllRegistrants:0000000000000000
2. Verify that the specified HexValue has been registered to the disk:
/usr/lpp/mmfs/bin/tsprreadkeys sdp
/usr/lpp/mmfs/bin/tsprreadkeys sdp
/usr/lpp/mmfs/bin/tsprreadres sdp
no:::
mmlsdisk dmfs2 -M
# multipath -ll
mpathae (36005076304ffc0e50000000000000001) dm-30 IBM,2107900
[size=10G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=8][active]
\_ 1:0:5:1 sdhr 134:16 [active][ready]
\_ 1:0:4:1 sdgl 132:16 [active][ready]
\_ 1:0:1:1 sdff 130:16 [active][ready]
\_ 1:0:0:1 sddz 128:16 [active][ready]
\_ 0:0:7:1 sdct 70:16 [active][ready]
\_ 0:0:6:1 sdbn 68:16 [active][ready]
\_ 0:0:5:1 sdah 66:16 [active][ready]
\_ 0:0:4:1 sdb 8:16 [active][ready]
The mmlsdisk output shows that I/O for NSD m0001 is being performed on disk /dev/sdb, but it should
show that I/O is being performed on the device-mapper multipath (DMM) /dev/dm-30. Disk /dev/sdb is
one of eight paths of the DMM /dev/dm-30 as shown from the multipath command.
This problem could occur for the following reasons:
• The previously installed user exit /var/mmfs/etc/nsddevices is missing. To correct this, restore
user exit /var/mmfs/etc/nsddevices and restart GPFS.
• The multipath device type does not match the GPFS known device type. For a list of known device
types, see /usr/lpp/mmfs/bin/mmdevdiscover. After you have determined the device type for
your multipath device, use the mmchconfig command to change the NSD disk to a known device type
and then restart GPFS.
The following output shows that device type dm-30 is dmm:
To change the NSD device type to a known device type, create a file that contains the NSD name and
device type pair (one per line) and issue this command:
mmchconfig updateNsdType=/tmp/filename
m0001 dmm
Kernel panics with the message "GPFS deadman switch timer has
expired and there are still outstanding I/O requests"
This problem can be detected by an error log with a label of KERNEL_PANIC, and the PANIC MESSAGES
or a PANIC STRING.
For example:
GPFS Deadman Switch timer has expired, and there are still outstanding I/O requests
GPFS is designed to tolerate node failures through per-node metadata logging (journaling). The log file is
called the recovery log. In the event of a node failure, GPFS performs recovery by replaying the recovery
log for the failed node, thus restoring the file system to a consistent state and allowing other nodes to
continue working. Prior to replaying the recovery log, it is critical to ensure that the failed node has indeed
failed, as opposed to being active but unable to communicate with the rest of the cluster.
In the latter case, if the failed node has direct access (as opposed to accessing the disk with an NSD
server) to any disks that are a part of the GPFS file system, it is necessary to ensure that no I/O requests
submitted from this node complete once the recovery log replay has started. To accomplish this, GPFS
uses the disk lease mechanism. The disk leasing mechanism guarantees that a node does not submit any
more I/O requests once its disk lease has expired, and the surviving nodes use disk lease time out as a
guideline for starting recovery.
This situation is complicated by the possibility of 'hung I/O'. If an I/O request is submitted prior to the disk
lease expiration, but for some reason (for example, device driver malfunction) the I/O takes a long time
to complete, it is possible that it may complete after the start of the recovery log replay during recovery.
This situation would present a risk of file system corruption. In order to guard against such a contingency,
when I/O requests are being issued directly to the underlying disk device, GPFS initiates a kernel timer
that is referred to as the deadman switch timer. The deadman switch timer goes off in the event
of disk lease expiration, and checks whether there is any outstanding I/O requests. If there is any I/O
pending, a kernel panic is initiated to prevent possible file system corruption.
Such a kernel panic is not an indication of a software defect in GPFS or the operating system kernel, but
rather it is a sign of
1. Network problems (the node is unable to renew its disk lease).
2. Problems accessing the disk device (I/O requests take an abnormally long time to complete). See
“MMFS_LONGDISKIO” on page 276.
IOMMU disabled
ACS disabled
If you want to enable tracing of your CUDA application, adopt the following corresponding settings that
are available in /etc/cufile.json.
Note: These settings may impact GDS performance.
• Log level: Level of information to be logged.
• Log location: By default, the trace is written into the current working directory of the CUDA application.
Troubleshooting information in NVIDIA documentation is available at GPUDirect Storage Troubleshooting.
You can use the mmhealth node show GDS command to check the health status of the GDS
component. For more information about the various options that are available with mmhealth command,
see mmhealth command in IBM Storage Scale: Command and Programming Reference Guide.
Error recovery
CUDA retries failed GDS read and GDS write requests in the compatibility mode. As the retry is a regular
POSIX read() or write() system call, all GPFS limitations regarding error recovery apply in general.
Restriction counters
# mmdiag --gds
mmfslog
The GDS feature IBM Storage Scale, provides specific entries in the mmfslog file indicating a successful
initialization.
If the IBM Storage Scale log file contains the following warning message:
[W] VERBS RDMA open error verbsPort <port> due to missing support for atomic operations for
device <device>
Check the description of the verbsRdmaWriteFlush configuration variable in the mmchconfig command
topic in IBM Storage Scale: Command and Programming Reference Guide for possible options.
Syslog
Detailed information about the NVIDIA driver registration and de-registration can be found in the syslog in
case of errors. The corresponding messages look similar to:
Traces
Specific GDS I/O traces can be generated by using the mmtracectl command. For more details, see
mmtracectl command in IBM Storage Scale: Command and Programming Reference Guide.
Support data
If all previous steps do not help and support needs to collect debug data, use the gpfs.snap command
to download all relevant files and diagnostic data to analyze the potential issues. For more details about
the various options that are available with the gpfs.snap command, see gpfs.snap command in IBM
Storage Scale: Command and Programming Reference Guide.
Common errors
1. RDMA is not enabled.
GPUDirect Storage (GDS) requires RDMA to be enabled. If RDMA is not enabled, an I/O error
(EIO=-5) occurs as shown in the following example:
# gdsio -f /ibm/gpfs0/gds/file.dat -x 0 -I 0 -s 1G -i 1m -d 0 -w 1
Error: IO failed stopping traffic, fd :33 ret:-5 errno :1
io failed :ret :-5 errno :1, file offset :0, block size :1048576
When such an error occurs, verify that the system is configured correctly. For more information,
see the Configuring GPUDirect Storage for IBM Storage Scale topic in IBM Storage Scale RAID:
Administration.
# gdsio -f /ibm/gpfs0/gds/file.dat -x 0 -I 0 -s 1G -i 1m -d 0 -w 1
Error: IO failed stopping traffic, fd :27 ret:-5008 errno :17
io failed : GPUDirect Storage not supported on current file, file offset :0, block size
:1048576
When such an error occurs, verify that the system is configured correctly. For more information, see
the Configuring GPUDirect Storage for IBM Storage Scale topic in IBM Storage Scale: Administration
Guide.
Important: Ensure that the rdma_dev_addr_list configuration parameter has the correct value in
the /etc/cufile.json file.
Encryption issues
The topics that follow provide solutions for problems that might be encountered while setting up or using
encryption.
When mmapplypolicy is invoked to perform a key rewrap, the command may issue messages like the
following:
[E] Error on gpfs_enc_file_rewrap_key(/fs1m/sls/test4,KEY-d7bd45d8-9d8d-4b85-a803-e9b794ec0af2:hs21n56_new,KEY-40a0b68b-
c86d-4519-9e48-3714d3b71e20:js21n92)
Permission denied(13)
Authentication issues
This topic describes the authentication issues that you might experience while using file and object
protocols.
To check the active events for authentication, run the following command:
To check the current authentication configuration across the cluster, run the following command:
Solution
Rectify the configuration by running the following command:
Authorization issues
You might receive an unexpected “access denied” error either for native access to file system or for using
the SMB or NFS protocols. Possible steps for troubleshooting the issue are described here.
Note: ACLs used in the object storage protocols are separate from the file system ACLs, and
troubleshooting in that area should be done differently. For more information, see “Object issues” on
page 462.
If the output does not report an NFSv4 ACL type in the first line, then consider changing the ACL
to the NFSv4 type. For more information on how to configure the file system for the recommended
NFSv4 ACL type for protocol usage, see the Authorizing file protocol users topic in the IBM Storage
Scale: Administration Guide. Also, review the ACL entries for permissions related to the observed “access
denied” issue.
/usr/lpp/mmfs/bin/wbinfo -a 'domainname\username'
id 'domainname\username'
If the cluster is configured with a different authentication method, then query the group membership of
the user:
id 'username'
If the user is a member of many groups, compare the number of group memberships with the limitations
that are listed in the IBM Storage Scale FAQ. For more information, see https://www.ibm.com/docs/en/
STXKQY/gpfsclustersfaq.html.
If a group is missing, check the membership of the user in the missing group in the authentication server.
Also, check the ID mapping configuration for that group and check whether the group has an ID mapping
that is configured and if it is in the correct range. You can query the configured ID mapping ranges by
using this command:
If the expected groups are missing in the output from the ID command and the authentication method
is Active Directory with trusted domains, check the types of the groups in Active Directory. Not all group
types can be used in all Active Directory domains.
If the access issue is sporadic, repeat the test on all protocol nodes. Since authentication and ID mapping
is handled locally on each protocol node, it might happen that a problem affects only one protocol node,
and hence only protocol connections that are handled on that protocol node are affected.
For analyzing the trace, extract the trace and look for the error code NT_STATUS_ACCESS_DENIED in the
trace.
/usr/lpp/mmfs/bin/mmtracectl –start
/usr/lpp/mmfs/bin/mmtracectl --stop
Description
When the user tries to install the IBM Security Lifecycle Manager prerequisites, the system displays the
following error:
Cause
The system displays this error when the system packages are not upgraded.
Proposed workaround
• All system packages must be upgraded, except the kernel that should be 6.3 in order for encryption to
work correctly.
• Update all packages excluding kernel:
• Modify: /etc/yum.conf
[main]
…
exclude=kernel* redhat-release*
Description
When the user tries to install IBM Security Lifecycle Manager, the system displays the following errors:
Cause
The system displays this error when the system packages are not upgraded.
Proposed workaround
• All system packages must be upgraded, except the kernel that should be 6.3 in order for encryption to
work correctly.
• Run through the following checklist before installing IBM Security Lifecycle Manager:
NFS issues
This topic describes some of the possible problems that can be encountered when GPFS interacts with
NFS.
If you encounter server-side issues with NFS:
1. Identify which NFS server or CES node is being used.
2. Run and review output of mmhealth.
3. Check whether all required file systems are mounted on the node that is being used, including the CES
shared root.
4. Review /var/log/ganesha.log. Messages tagged as CRIT, MAJ, or EVENT are about the state of
the NFS server.
5. Use ganesha_stats utility to monitor the NFS performance.
This utility can capture statistics for the NFS server. For example, for all NFS v3 and NFS v4 operations,
export, authentication related etc. This utility is not cluster aware and provides information only on the
NFS server that is running on the NFS node.
When GPFS interacts with NFS, you can encounter the following problems:
• “NFS client with stale inode data” on page 443
• “NFSv4 ACL problems” on page 405
ping <server-ip>
ping <server-name>
The expected results are that the output indicates that the NFS service is running as in this
example:
Enabled services: SMB NFS
SMB is running, NFS is running
b. On the NFS server node, issue the following command:
rpcinfo -p
The expected result is that portmapper, mountd, and NFS are running as shown in the
following sample output.
iptables -L
Then, check whether any hosts or ports that are involved with the NFS connection are blocked
(denied).
If the client and the server are running in different subnets, then a firewall might be running on the
router.
4. Check whether the firewall is blocking NFS traffic on the client or router by using the appropriate
commands.
Solution
On the NFS server, specify an access type (for example, RW for Read and Write) for export. If
the export is already created, then you can change the access type by using the mmnfs export
change command. See the following example. The backslash (\) is a line continuation character.
Verification
Verify that the access type is specified for NFS export by using the mmnfs export list command
on the NFS server. For example,
Path Delegations Clients Access Protocols Transports Squash Anonymous Anonymous SecType
PrivilegedPort Export Default Manage NFS_Commit
_Type _uid
_gid _id Delegation _Gids
-------------------------------------------------------------------------------------------------------
--------------------------------------------------
/mnt/gpfs0/ none * RW 3,4 TCP NO_ROOT -2 -2 KRB5
FALSE 2 none FALSE FALSE
_share1 _SQUASH
Solution
Verification
Verify the protocols that are specified for the export by using the mmnfs export change command. For
example,
Path Delegations Clients Access Protocols Transports Squash Anonymous Anonymous SecType PrivilegedPort Default Manage NFS_Commit
_Type _uid _gid Delegation _Gids
---------------------------------------------------------------------------------------------------------------------------------------------------------
/mnt/gpfs0/ none * RW 3,4 TCP NO_ROOT -2 -2 SYS FALSE none FALSE FALSE
nfs_share1 _SQUASH
mmlscluster --ces
mmces service list -a
2. Ensure that the firewall allows NFS traffic to pass through. To allow the NFS traffic, the CES
NFS service must be configured with explicit NFS ports so that discrete firewall rules can be
established. Issue the following command on the client.
3. Verify that the NFS client is allowed to mount the export. In NFS terms, a definition exists for this
client for the export to be mounted. Check NFS export details by using the following command.
Path Delegations Clients Access_Type Protocols Transports Squash Anonymous_uid Anonymous_gid SecType PrivilegedPort
DefaultDelegations Manage_Gids NFS_Commit
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------
/mnt/gpfs0/nfs_share1 none * RW 3,4 TCP NO_ROOT_SQUASH -2 -2 SYS FALSE none
FALSE FALSE
showmount -e <CES_IP_ADDRESS>
Mount the server virtual file-system root / on an NFSv4 client. Navigate through the virtual file
system to the export.
If you have a remote cluster environment with an owning cluster and an accessing cluster, and
the accessing cluster exports the file system of the owning cluster through CES NFS, IP failback
might occur before the remote file systems are mounted. This action can cause I/O failures with
existing CES NFS client mounts and new mount request failures. To avoid I/O failures, stop and
start CES NFS on the recovered node after you run the mmstartup and mmmount <remote FS>
commands. Stop and restart the CES NFS by using the following commands.
Solution
Restart CES NFS on the local CES node by using commands mmces service stop nfs and mmces
service start nfs.
ps -C gpfs.ganesha.nfsd
/usr/bin/ganesha_stats
ERROR: Can't talk to ganesha service on d-bus. Looks like Ganesh is down.
Solution
Restart CES NFS on the local CES node by using commands mmces service stop nfs and mmces
service start nfs.
ps -C rpc.statd
Solution
Restart CES NFS on the local CES node by using commands mmces service stop nfs and mmces
service start nfs.
Solution
Check to see whether portmapper is running and if portmapper (rpcbind) is configured to automatically
start on system startup.
NFS client cannot mount NFS exports from all protocol nodes
Cause
The NFS client can mount NFS exports from some but not all protocol nodes, because the exports are not
seen when doing a showmount against those protocol nodes where this problem surfaces.
Determination
Solution
On the CNFS shared directory, change the permission to 755 so that the shared directory is readable to all
users.
You might need to reboot a node, if this problem persists after the NFS service restart.
For more information about NFS events, see “Events” on page 559.
mmgetacl Path
umount <Path>
mount <mount_options> CES_IP_address:<export_path> <mount_point>
CES NFS log levels can be user adjusted to select the amount of logging by the server. Every increase in
log setting will add additional messages. So the default of "Event" will include messages tagged as EVENT,
WARN, CRIT, MAJ, FATAL, but will not show INFO, DEBUG, MID_DEBUG, FULL_DEBUG:
SMB issues
This topic describes SMB-related issues that you might come across while using the IBM Storage Scale
system.
mmlscluster --ces
This shows at a glance whether nodes are failed or whether they host public IP addresses. For
successful SMB operation at least one CES node must be HEALTHY and hosting at least one IP address.
• To show which services are enabled, issue the following command:
NODE SMB
prt001st001 HEALTHY
prt002st001 HEALTHY
prt003st001 HEALTHY
prt004st001 HEALTHY
prt005st001 HEALTHY
prt006st001 HEALTHY
prt007st001 HEALTHY
• To show the reason for a currently active (failed) state on all nodes, issue the following command:
In this case nothing is listed because all nodes are healthy and so there are no active events. If a node
was unhealthy it would look similar to this:
• To show the history of events generated by the monitoring framework, issue the following command:
• To retrieve monitoring state from health monitoring component, issue the following command:
/var/adm/ras/*
/var/log/messages
Cause
The system did not recognize the specified password.
Verification
Verify the password by running the following command on an IBM Storage Scale protocol node:
/usr/lpp/mmfs/bin/wbinfo -a '<domain>\<user>'
Cause
SMB client on Linux fails with the NT status password must change
error message
This topic describes how to verify and resolve an NT status password must change error on the
SMB client on Linux.
Description
The user is trying to access the SMB client on Linux and receives this error message:
NT_STATUS_PASSWORD_MUST_CHANGE
Cause
The specified password expired.
Verification
Verify the password by running the following command on an IBM Storage Scale protocol node:
/usr/lpp/mmfs/bin/wbinfo -a '<domain>\<user>'
The root causes for this error are the same as for “SMB client on Linux fails with an NT status logon
failure” on page 455.
Mount.Cifs on Linux fails with mount error (127) "Key has expired"
Description
The user is trying to access a CIFS share and receives the following error message:
key has expired
The root causes for this error are the same as for “SMB client on Linux fails with an NT status logon
failure” on page 455.
The root causes for this error are the same as for “SMB client on Linux fails with an NT status logon
failure” on page 455.
Solution
The root causes for this error are the same as that for the failure of SMB client on Linux. For more
information on the root cause, see “SMB client on Linux fails with an NT status logon failure” on page 455.
Net use on Windows fails with "System error 59" for some users
This topic describes how to resolve a "System error 59" when some users attempt to access /usr/lpp/
mmfs/bin/net use on Windows.
Description:
Additional symptoms include
NT_STATUS_INVALID_PARAMETER
errors in the log.smbd file when net use command was invoked on the Windows client for the user with
this problem.
Solution:
gpfs.snap
If GPFS, network, or file system are reported as DEGRADED, then investigate the issue and fix the problem.
In addition, you can also check the /var/adm/ras/log.smbd log file on all protocol nodes.
An entry of vfs_gpfs_connect: SMB share fs1, path /ibm/fs1 not in GPFS file
system. statfs magic: 0x58465342 in the log file indicates that the SMB share path does not
point to a GPFS file system or that the file system is not mounted. If the file system is not mounted, then
you must mount the file system again on the affected nodes.
When ctdb points to a hot record in locking.tdb, then use the "net tdb locking" command to determine the
file behind this record:
If this happens on the root directory of an SMB export, then a workaround can be to exclude that from
cross-node locking:
CTDB issues
CTDB is a database layer for managing SMB and Active Directory specific information and provides it
consistently across all CES nodes.
CTDB requires network connections to TCP port 4379 between all CES nodes. Internally, CTDB elects
a recovery master among all available CTDB nodes. The elected node then acquires a lock on a
recovery lock file in the CES shared root file system to ensure that no other CES node tries to do
If a status is reported as DISCONNECTED, ensure that all the CES nodes are up and running and
network connections to TCP port 4379 are allowed.
If a status is reported as BANNED check the logs files.
2. Check the CTDB log files on all nodes:
CTDB logs in to the standard syslog. The default syslog file name varies among the Linux
distributions, for example:
/var/log/messages
/var/log/syslog
This usually indicates a communication problem between CTDB on different CES nodes. Check the
node local firewall settings, any network firewalls, and routing to ensure that connections to TCP port
4379 are possible between the CES nodes.
2. Start smbd.
3. After samba is started, remove the authentication configuration and create authentication.
Object issues
The following information describes some of the Object-related issues that you might come across while
using IBM Storage Scale.
Important:
• CES Swift Object protocol feature is not supported from IBM Storage Scale 5.1.9 onwards.
• IBM Storage Scale 5.1.8 is the last release that has CES Swift Object protocol.
Checklist 1
This checklist must be referred to before using an object service.
1. Check the cluster state by running the mmgetstate -a command.
The cluster state must be Active.
2. Check the status of the CES IP by running the mmlscluster -ces command.
The system displays the all the CES nodes along with their assigned IP addresses.
3. Check the service states of the CES by running the mmces state show -a or mmhealth node
show ces -N cesnodes command.
The overall CES state and object service states must be Healthy.
4. Check the service listing of all the service states by running the mmces service list -verbose
command.
5. Check the authentication status by running the mmuserauth service check command.
6. Check the object auth listing by running the source $HOME/openrc ; openstack user list
command.
The command lists the users in the OpenStack Keystone service.
Checklist 2
This checklist must be referred to before using the keystone service.
1. Check if object authentication has been configured by running the mmuserauth service list
--data-access-method object command.
2. Check the state of object authentication by running the mmces state show AUTH_OBJ -a
command.
3. Check if the protocol node is serving the CES IP by running the mmlscluster --ces command.
4. Check if the object_database_node tag is present in one of the CES IP by running the mmces address
list command.
5. Check if httpd is running on all the CES nodes and postgres is running on the node that has CES IP with
the object_database_node tag by running the mmces service list -v -a command.
6. Check if authentication configuration is correct on all nodes by running the mmuserauth service
check --data-access-method object -N cesNodes command.
7. If the mmuserauth service check reports an error, run the mmuserauth service check --data-
access-method object --rectify -N <node> command where node is the number of the node
on which the error is reported.
Description
When you authenticate or run any create, update, or delete operation, the system displays one of the
following errors:
{"error": {"message": "An unexpected error prevented the server from fulfilling your request.",
"code": 500, "title": "Internal Server Error"}}
ERROR: openstack An unexpected error prevented the server from fulfilling your request.
(HTTP 500)(Request-ID: req-11399fd1-a601-4615-8f70-6ba275ec3cd6)
Cause
The system displays this error under one or all three of the following conditions:
• The authentication service is not running.
• The system is unable to reach the authentication server.
• The user credentials for Keystone have been changed or have expired.
Proposed workaround
• Finish the steps in Checklist 1.
• Make sure that the IP addresses of the Keystone endpoints are correct and reachable. If you are using a
local Keystone, check if the postgresql-obj service is running.
Description
When the user is authenticating the object service or running the create, update, retrieve, and delete
operations, the system displays the following error:
Error: {"error": {"message": "The request you have made requires authentication.",
"code": 401, "title": "Unauthorized"}}
Cause
The system displays this error under one or both of the following conditions:
• The password, user ID, or service ID entered is incorrect.
• The token being used has expired.
Proposed workaround
• Check your user ID and password. The user IDs in the system can be viewed in the OpenStack user list.
• Check to make sure a valid service ID is provided in the /etc/swift/proxy-server.conf file, in the
filter:authtoken section. Also, check if the password for the service ID is still valid. The service ID
can be viewed in the OpenStack service, project, and endpoint lists.
Description
When an unauthorized user is accessing an object resource, the system displays the following error:
Error: Error: HTTP/1.1 403 Forbidden
Content-Length: 73 Content-Type: text/html; charset=UTF-8 X-Trans-Id:
tx90ad4ac8da9242068d111-0056a88ff0 Date: Wed, 27 Jan 09:37:52 GMT
<html><h1>Forbidden</h1><p>Access was denied to this resource.</p>
Cause
The system displays this error under one or all of the following conditions:
• The user is not authorized by the system to access the resources for a certain operation.
• The endpoint, authentication URL, service ID, keystone version, or API version is incorrect.
Proposed workaround
• To gain authorization and access the resources, contact your system administrator.
• Check your service ID. The service ID can be viewed in the OpenStack service, project, and endpoint
lists.
Description
When the user is unable to connect to the object services, the system displays the following error:
Cause
The system displays this error because of one or both of the following conditions:
• The firewall is running.
• The firewall rules are configured incorrectly.
Proposed workaround
Set up the firewall rules correctly in your system.
For more information, see Installation prerequisites in IBM Storage Scale: Concepts, Planning, and
Installation Guide.
Description
While you perform a create, update, retrieve, or delete task, if you attempt to create a non-existent path
the system displays the following error:
Cause
The system displays this error because the path you are creating does not exist.
Proposed workaround
Recreate the object or the container before you perform the GET operation.
Description
When the user is trying to create objects and containers for unified file and object access, the system
displays the 400 Bad request error.
Cause
The system displays this error under one or all five of the following conditions:
• The name of the container is longer than 255 characters.
• The name of the object is longer than 214 characters.
• The name of any container in the object hierarchy is longer than 214 characters.
• The path name of the object includes successive forward slashes (/).
• The name of the container and the object is a single period (.) or a double period (..).
Proposed workaround
Keep in mind the following constraints while creating objects and containers for unified file and object
access:
• Make the name of the container no more than 255 characters.
• Make the name of the object no more than 214 characters.
• Make the name of any container in the object hierarchy no more than 214 characters.
• Do not include multiple consecutive forward slashes (///) in the path name of the object.
• Do not make the name of the container or the object a single period (.) or a double period (..). However, a
single period or a double period can be part of the name of the container and the object.
Description
When object is configured with the AD/LDAP authentication and the bind password is being used for LDAP
communication, the system displays the following error:
[root@SSClusterNode3 ~]# openstack user list
ERROR: openstack An unexpected error prevented the server from fulfilling
your request. (HTTP 500) (Request-ID: req-d2ca694a-31e3-46cc-98b2-93556571aa7d)
Authorization Failure. Authorization failed: An unexpected error prevented
the server from fulfilling your request. (HTTP 500) (Request-ID: req-d6ccba54-
baea-4a42-930e-e9576466de3c)
Cause
The system displays this error when the Bind password has been changed on the AD/LDAP server.
Proposed workaround
1. Get the new password from the AD or LDAP server.
2. Run the following command to update the password and restart keystone on any protocol nodes:
mmobj config change --ccrfile keystone.conf --section ldap --property password --value
'<password>'
The value for <password> needs to be the value for new password that is obtained in Step 1.
Note: This command restarts Keystone on any protocol nodes.
The password used for running the keystone command has expired or is
incorrect
Refer to the following troubleshooting references and steps for resolving system errors when you are
using an expired or incorrect password for running the keystone command.
Description
When you try to run the keystone command by using a password that has expired or is incorrect, the
system displays the following error: [root@specscale ~]# openstack user list
ERROR: openstack The request you have made requires authentication. (HTTP 401)
(Request-ID: req-9e8d91b6-0ad4-42a8-b0d4-797a08150cea)
Cause
The system displays the following error when you change the password but are still using the expired
password to access Keystone.
Proposed workaround
Use the correct password to access Keystone.
Description
When object authentication is configured with AD/LDAP and the user is trying to run the keystone
commands, the system displays the following error:[root@specscale ~]# openstack user list
ERROR: openstack An unexpected error prevented the server from fulfilling your
request. (HTTP 500) (Request-ID: req-d3fe863e-da1f-4792-86cf-bd2f4b526023)
Cause
The system displays this error under one or all of the following conditions:
• Network issues make the LDAP server unreachable.
• The system firewall is running so the LDAP server is not reachable.
• The LDAP server is shut down.
Note:
When the LDAP server is not reachable, the keystone logs can be viewed in the /var/log/keystone
directory.
The following example is an LDAP error found in /var/log/keystone/keystone.log:
/var/log/keystone/keystone.log:2016-01-28 14:21:00.663 25720 TRACE
keystone.common.wsgi result = func(*args,**kwargs)2016-01-28 14:21:00.663 25720
TRACE keystone.common.wsgi SERVER_DOWN: {'desc': "Can't contact LDAP server"}.
Proposed workaround
• Check your network settings.
• Configure your firewall correctly.
• Repair the LDAP server.
Description
You might want to configure object authentication with Active Directory (AD) or Lightweight Directory
Access Protocol (LDAP) by using the TLS certificate for configuration. When you configure object
authentication with AD or LDAP, the system displays the following error:
Cause
The system displays this error because the TLS certificate has expired.
Description
You can configure the system with Active Directory (AD) or Lightweight Directory Access Protocol (LDAP)
and TLS. When you configure the system this way:
• The TLS CACERT expires after configuration.
• The user is trying to run the keystone command.
The system displays the following error:
Note:
The log files for this error can be viewed in /var/log/keystone/keystone.log.
Cause
The system displays this error because the TLS CACERT certificate has expired.
Proposed workaround
1. Obtain the updated TLS CACERT certificate on the system.
2. Rerun the object authentication command.
Note:
If you run the following command while doing the workaround steps, you might lose existing data:
-idmapdelete
Description
You can configure the system with AD or LDAP by using TLS. If the certificate on AD or LDAP expires, the
system displays the following error when the user is trying to run the Keystone commands:
Cause
The system displays this error because the TLS certificate on the LDAP server has expired.
Proposed workaround
Update the TLS certificate on the LDAP server.
Description
When object authentication is configured with SSL and you try to run the authentication commands, the
system displays the following error:
Cause
The system displays this error because the SSL certificate has expired. The user may have used the same
certificate earlier for keystone configuration, but now the certificate has expired.
Proposed workaround
1. Remove the object authentication.
2. Reconfigure the authentication with the new SSL certificate.
Note:
Do not run the following command during removing and reconfiguring the authentication:
Description
When the authentication type is Active Directory (AD) or Lightweight Directory Access Protocol (LDAP),
users are not listed in the OpenStack user list.
Cause
The system displays this error under one or both of the following conditions:
• Only the users under the specified user distinguished name (DN) are visible to Keystone.
• The users do not have the specified object class.
The error code signature does not match when using the S3 protocol
Refer to the following troubleshooting references and steps for resolving system errors when the error
code signature does not match.
Description
When there is an error code signature mismatch, the system displays the following error:
<?xml version="1.0" encoding="UTF-8"?><Error> <Code>SignatureDoesNotMatch</
Code> <Message>The request signature we calculated does not match the
signature you provided. Check your key and signing method.</Message>
<RequestId>tx48ae6acd398044b5b1ebd-005637c767</RequestId></Error>
Cause
The system displays this error when the specified user ID does not exist and the user ID does not have
the defined credentials or has not assigned a role to the account.
Proposed workaround
• For role assignments, review the output of these commands to identify the role assignment for the
affected user:
– OpenStack user list
– OpenStack role assignment list
– OpenStack role list
– OpenStack project list
• For credential issues, review the credentials assigned to that user ID:
– OpenStack credential list
– OpenStack credential show ID
Cause
The Boolean daemons_use_tty default setting is preventing the Swift object command output to
display.
Proposed workaround
Allow daemons to use TTY (teletypewriter) by running the following command:
setsebool -P daemons_use_tty 1
swift-object-info
Swift PUT returns the 202 error and S3 PUT returns the 500 error due to the
missing time synchronization
Refer to the following troubleshooting references and steps for resolving system errors when Swift PUT
returns the 202 error and S3 PUT returns the 500 error due to the missing time synchronization.
Description
The swift object servers require monotonically increasing timestamps on the PUT requests. If the time
between all the nodes is not synchronized, the PUT request can be rejected, resulting in the object server
returning a 409 status code that is turned into 202 in the proxy-server. When the s3api middleware
receives the 202 code, it returns a 500 to the client. When enabling DEBUG logging, the system displays
the following message:
From the object server:
Feb 9 14:41:09 prt001st001 object-server: 10.0.5.6
- - [09/Feb/2016:21:41:09 +0000] "PUT /z1device119/14886/
AUTH_bfd953e691c4481d8fa0249173870a56/mycontainers12/myobjects407"
From the proxy server:
Feb 9 14:14:10 prt003st001 proxy-server: Object PUT returning
202 for 409: 1455052450.83619 <= '409 (1455052458.12105)' (txn:
txf7611c330872416aabcc1-0056ba56a2) (client_ip:
If S3 is used, the following error is displayed from Swift3:
Feb 9 14:25:52 prt005st001 proxy-server: 500 Internal Server Error:
#012Traceback (most recent call last):#012 File "/usr/lib/python2.7/
site-packages/swift3/middleware.py", line 81, in __call__#012 resp =
self.handle_request(req)#012 File "/usr/lib/python2.7/site-packages/swift3/
middleware.py", line 104, in handle_request#012 res = getattr(controller,
req.method)(req)#012 File "/usr/lib/python2.7/site-packages/swift3/controllers/
obj.py", line 97, in PUT#012 resp = req.get_response(self.app)#012
File "/usr/lib/python2.7/site-packages/swift3/request.py", line 825, in
get_response#012 headers, body, query)#012 File "/usr/lib/python2.7/site-
packages/swift3/request.py", line 805, in get_acl_response#012 app,
method, container, obj, headers, body, query)#012 File "/usr/lib/python2.7/
site-packages/swift3/request.py", line 669, in _get_response#012 raise
InternalError('unexpected status code %d' % status)#012InternalError: 500
Internal Server Error (txn: tx40d4ff7ca5b94b1bb6881-0056ba5960) (client_ip:
10.0.5.1) Feb 9 14:25:52 prt005st001 proxy-server: 500 Internal
Server Error: #012Traceback (most recent call last):#012 File "/usr/lib/
python2.7/site-packages/swift3/middleware.py", line 81, in __call__#012 resp
= self.handle_request(req)#012 File "/usr/lib/python2.7/site-packages/swift3/
middleware.py", line 104, in handle_request#012 res = getattr(controller,
req.method)(req)#012 File "/usr/lib/python2.7/site-packages/swift3/controllers/
obj.py", line 97, in PUT#012 resp = req.get_response(self.app)#012
File "/usr/lib/python2.7/site-packages/swift3/request.py", line 825, in
get_response#012 headers, body, query)#012 File "/usr/lib/python2.7/site-
packages/swift3/request.py", line 805, in get_acl_response#012 app,
method, container, obj, headers, body, query)#012 File "/usr/lib/python2.7/
site-packages/swift3/request.py", line 669, in _get_response#012 raise
InternalError('unexpected status code %d' % status)#012InternalError: 500
Cause
The system displays these errors when the time is not synchronized.
Proposed workaround
• To check whether this problem is occurring, run the following command:
mmdsh date
• Enable the NTPD service on all protocol nodes and synchronize the time from a Network Time Protocol
(NTP) server.
Description
The accurate container listing for a unified file or an object access-enabled container is not displayed on
the system.
Cause
This error occurs under one or both of the following conditions:
• A longer time is taken to update and display the listing because the ibmobjectizer interval is too long.
• Objectization phis not supported for the files that you create on the file system.
Proposed workaround
Tune the ibmobjectizer interval configuration by running the following command to set up the
objectization interval:
This command sets an interval of 40 minutes between the completion of an objectization cycle and the
start of the next cycle.
Description
When you enable object by using installation toolkit, the system displays the following error:
Proposed workaround
Run the spectrumscale config obj command with the mandatory arguments.
Description
When you configure the authentication by using the installation toolkit, the system displays the following
error:
2016-02-16 13:48:07,799 [ FATAL ] <nodename> failure whilst: Configuring object
authentication (SS98)
Cause
The system displays this error under one or both of the following conditions:
• Only the users under the specified user DN are visible to Keystone.
• The users do not have the specified object class.
Proposed workaround
You can change the object authentication or modify the AD or LDAP for anyone who has the specified
object class.
Description
When the user configures authentication by using installation toolkit, the system displays the following
error:
02-16 13:48:07,799 [ FATAL ] <nodename> failure whilst: Configuring object
authentication (SS98)
Cause
The system displays this error under one or all three of the following conditions:
• IBM Storage Scale for the object storage program is running.
• Parameters that are provided in the configuration.txt and authconfig.txt files are incorrect.
• The system is unable to connect to the authentication server with the given credentials or network
issues.
Proposed workaround
1. Shut down IBM Storage Scale for the object storage program before continuing.
2. Check the connectivity of protocol nodes with the authentication server by using valid credentials.
...
6027-435 [N] The file system descriptor quorum has been overridden.
6027-490 [N] The descriptor replica on disk gpfs23nsd has been excluded.
6027-490 [N] The descriptor replica on disk gpfs24nsd has been excluded.
...
For more information on node override, see the section on Quorum, in the IBM Storage Scale: Concepts,
Planning, and Installation Guide
For PPRC and FlashCopy®-based configurations, more problem determination information can be
collected from the ESS log file. This information and the appropriate ESS documentation must be referred
while working with various types disk subsystem-related failures. For instance, if users are unable to
perform a PPRC failover (or failback) task successfully or unable to generate a FlashCopy of a disk volume,
they should consult the subsystem log and the appropriate ESS documentation. For more information, see
the following topics:
• IBM TotalStorage™ Enterprise Storage Server® Web Interface User's Guide (publibfp.boulder.ibm.com/
epubs/pdf/f2bui05.pdf).
In such scenarios, check the network connectivity between the peer GPFS clusters and verify their
remote shell setup. This command requires full TCP/IP connectivity between the two sites, and all
nodes must be able to communicate by using ssh or rsh without the use of a password.
Problem identification
On the node, issue an Operating System command such as top or dstat to verify whether the system
level resource utilization is higher than 90%. The following example shows the sample output for the
dstat command:
# dstat 1 10
Problem identification
On the node, issue the mmdiag --waiters command to check whether any long waiters are present.
The following example shows long waiters that are contributed by the slow disk, dm-14:
0x7FF074003530 waiting 25.103752000 seconds, WritebehindWorkerThread: for I/O completion on disk dm-14
0x7FF088002580 waiting 30.025134000 seconds, WritebehindWorkerThread: for I/O completion on disk dm-14
#ifconfig ib0
Problem identification
Verify whether all the layers of the IBM Storage Scale cluster are sized properly to meet the necessary
performance requirements. The things to be considered in the IBM Storage Scale cluster include:
• The servers
• The network connectivity and the number of connections between the NSD client and servers
• The I/O connectivity and number of connections between the servers to the storage controller or
subsystem
• The storage controller
• The disk type and the number of disks in the storage subsystem
In addition, get the optimal values for the low-level system components used in the IBM Storage Scale
stack from the vendor, and verify whether these components are set to their optimal value. The low-level
components must be tuned according to the vendor specifications for better performance.
Problem identification
The mmlsconfig command can be used to display and verify the configuration values for an IBM Storage
Scale cluster.
Issue the mmdiag --config command on the newly added GPFS nodes to verify whether the
configuration parameter values for the new nodes are same as values for the existing nodes. If the
newly added nodes have special roles or higher capability, then the configuration values must be adjusted
accordingly.
Certain applications like SAS benefit from a larger GPFS page pool. The GPFS page pool is used to cache
user file data and file system metadata. The default size of the GPFS page pool is 1 GiB in GPFS version
3.5 and higher. For SAS application, a minimum of 4 GiB page pool size is recommended. When new SAS
application nodes are added to the IBM Storage Scale cluster, ensure that the pagepool attribute is set
to at least 4 GiB. If left to its default value, the pagepool attribute is set to 1 GiB. This negatively impacts
the application performance.
[c25m3n03-ib,c25m3n04-ib]
pagepool 2G
If you add new application nodes c25m3n05-ib and c25m3n06-ib to the cluster, the pagepool
attribute and other GPFS parameter values for the new node must be set according to the corresponding
parameter values for the existing nodes c25m3n03-ib and c25m3n04-ib. Therefore, the pagepool
attribute on these new nodes must also be set to 2G by using the mmchconfig command.
Note: The -i option specifies that the changes take effect immediately and are permanent. This option is
valid only for certain attributes. For more information on block allocation, see the mmchconfig command
in the IBM Storage Scale: Command and Programming Reference Guide.
Issue the mmlsconfig command to verify whether all the nodes have similar values. The following
sample output shows that all the nodes have pagepool attribute set to 2G:
[c25m3n03-ib,c25m3n04-ib,c25m3n05-ib,c25m3n06-ib]
pagepool 2G
Note: If the pagepool attribute is set to a custom value (2G for this example), then the pagepool
attribute value is listed when you issue the mmlsconfig command. If the pagepool attribute is set to a
default value (1G) then this will be listed when you issue the mmlsconfig pagepool command.
On the new node, issue the mmdiag --config command to verify that the new values are in effect. The
sample output displays that the pagepool attribute value has been effectively set to 2G for the nodes
c25m3n03-ib, c25m3n04-ib,c25m3n05-ib, c25m3n06-ib:
! pagepool 2147483648
Note: The exclamation (!) in the front of the parameter denotes that the value of this parameter was set
by the user, and is not the default value for the parameter.
Problem identification
On the GPFS node, issue the mmlsqos <fs> command and check the other and maintenance class
settings. In the sample output below, the maintenance class IOPS for datapool1 storage pool is set
to 200 IOPS, and the other class IOPS for datapool2 storage pool is set to 400 IOPS. This IOPS value
might be low for an environment with high performing storage subsystem.
# mmlsqos gpfs1b
Problem identification
Issue the mmlsnsd command, and verify that the primary NSD server allocated to a file system is evenly
distributed.
Note: The primary server is the first server listed under the NSD server column for a particular file
system.
On the NSD client, issue the mmlsdisk <fs> -m command to ensure that the NSD client I/O is
distributed evenly across all the NSD servers.
In the following sample output, all the NSDs are assigned to the same primary server c80f1m5n03ib0.
# mmlsnsd
In this case, all the NSD client I/O for the gpfs2 file system are processed by the single NSD server
c80f1m5n03ib0, instead of being equally distributed across both the NSD servers c80f1m5n02ib0 and
c80f1m5n03ib0. This can be verified by issuing the mmlsdisk <fs> -m command on the NSD client,
as shown in the following sample output:
# mmlsdisk gpfs2 -m
The NSD client I/O is also evenly distributed across the two NSD servers, as seen in the following sample
output:
# mmlsdisk gpfs2 -m
Problem identification
Issue the mmlsfs command to verify the block allocation type that is in effect on the smaller and larger
setup file system.
In the sample output below, the Block allocation type for the gpfs2 file system is set to scatter.
Problem identification
In IBM Storage Scale, the system-defined node class “nsdnodes” contains all the NSD server nodes in
the IBM Storage Scale cluster. Issue the mmgetstate –N nsdnodes command to verify the state of the
GPFS daemon. The GPFS file system performance might degrade if one or more NSD servers are in the
down or arbitrating or unknown state.
The following example displays two nodes: one in active state and the other in down state
Problem identification
The mmlsnsd command displays information about the currently defined disks in a cluster. In the
following sample output, the NSD client is configured to perform file system I/O on the primary NSD
server c25m3n07-ib for odd-numbered NSDs like DMD_NSD01, DMD_NSD03. In this case, c25m3n08-ib
acts as a secondary server.
The NSD client is configured to perform file system I/O on the NSD server c25m3n08-ib for even-
numbered NSDs like DMD_NSD02,DMD_NSD04. In this case, c25m3n08-ib is the primary server, while
c25m3n07-ib acts as the secondary server.
Issue the #mmlsnsd command to display the NSD server information for the disks in a file system. The
following sample output shows the various disks in the gpfs1b file system and the NSD servers that are
supposed to act as primary and secondary servers for these disks.
# mmlsnsd
However, the mmlsdisk <fsdevice> -m command that is issued on the NSD client indicates that the
NSD client is currently performing all the file system I/O on a single NSD server, c25m3n07-ib.
NSD-Name Primary-NSD-Server
DMD_NSD01 c25m3n07-ib
DMD_NSD02 c25m3n08-ib
DMD_NSD03 c25m3n07-ib
DMD_NSD04 c25m3n08-ib
DMD_NSD05 c25m3n07-ib
DMD_NSD06 c25m3n08-ib
DMD_NSD07 c25m3n07-ib
DMD_NSD08 c25m3n08-ib
DMD_NSD09 c25m3n07-ib
DMD_NSD10 c25m3n08-ib
Problem identification
On the GPFS node, issue the mmdf <fs> command to determine the available space.
# mmdf gpfs1b
============= ====================
===================
(total) 17560944640 17323003904 ( 99%)
157504 ( 0%)
Inode Information
-----------------
Number of used inodes: 4048
Number of free inodes: 497712
Number of allocated inodes: 501760
Maximum number of inodes: 17149440
The UNIX command df also can be used to determine the use percentage (Use%) of a file system. The
following sample output displays a file system with 2% capacity used.
# df -h
# mmdf gpfs1b
============= ====================
===================
(total) 17560944640 17340739584
( 99%) 128832 ( 0%)
Inode Information
-----------------
Number of used inodes: 4075
Number of free inodes: 497685
Number of allocated inodes: 501760
Maximum number of inodes: 17149440
CAUTION: Exercise extreme caution when you delete files. Ensure that the files are no longer
required for any purpose or are backed up before you delete them.
Problem identification
Issue the mmlsconfig | grep verbsRdma command to verify whether VERBS RDMA is enabled on the
IBM Storage Scale cluster.
# mmlsconfig | grep verbsRdma
verbsRdma enable
If VERBS RDMA is enabled, check whether the status of VERBS RDMA on a node is Started by running
the mmfsadm test verbs status command.
# mmfsadm test verbs status
The following sample output shows the various disks in the gpfs1b file system and the NSD servers that
are supposed to act as primary and secondary servers for these disks.
Issue the mmfsadm test verbs conn command to verify whether the NSD client node is
communicating with all the NSD servers that use VERBS RDMA. In the following sample output, the
NSD client node has VERBS RDMA communication active on only one of the two NSD servers.
# mmfsadm test verbs conn
Problem resolution
Resolve any low-level InfiniBand RDMA issue like loose InfiniBand cables or InfiniBand fabric issues.
When the low-level RDMA issues are resolved, issue system commands like ibstat or ibv_devinfo to
verify whether the InfiniBand port state is active. The following system output displays the output
for an ibstat command issued. In the sample output, the port state for Port 1 is Active, while that for
Port 2 is Down.
# ibstat
CA 'mlx5_0'
CA type: MT4113
Number of ports: 2
Firmware version: 10.100.6440
Hardware version: 0
Node GUID: 0xe41d2d03001fa210
System image GUID: 0xe41d2d03001fa210
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 29
LMC: 0
SM lid: 1
Capability mask: 0x26516848vverify
Port GUID: 0xe41d2d03001fa210
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x26516848
Port GUID: 0xe41d2d03001fa218
Link layer: InfiniBand
Restart GPFS on the node and check whether the status of VERBS RDMA on a node is Started by running
the mmfsadm test verbs status command.
In the following sample output, the NSD client (c25m3n03-ib) and the two NSD servers all show VERBS
RDMA status as started.
Perform a large I/O activity on the NSD client, and issue the mmfsadm test verbs conn command to
verify whether the NSD client node is communicating with all the NSD servers that use VERBS RDMA.
In the following sample output, the NSD client node has VERBS RDMA communication active on all the
active NSD servers.
# mmfsadm test verbs conn
Problem identification
Check the GPFS log file /var/adm/ras/mmfs.log.latest on the File System Manager node mmlsmgr
to verify whether any GPFS maintenance operations are in progress.
The following sample output shows that the mmrestripefs operation was initiated on Jan 19 at
14:32:41, and the operation was successfully completed at 14:45:42. The I/O performance of the
application is impacted during this time frame due to the execution of the mmrestripefs command.
# mmlsqos gpfs1a
QOS config:: disabled
QOS status:: throttling inactive, monitoring inactive
You can use the mmchqos command to allocate appropriate maintenance IOPS to the IBM Storage Scale
system. For example, consider that the storage system has 100 K IOPS. If you want to allocate 1000
IOPS to the long running GPFS maintenance operations for the system storage pool, use the mmchqos
command to enable the QoS feature, and allocate the IOPS as shown:
Verify the QoS setting and values on a file system by using the mmlsqos command.
# mmlsqos gpfs1aQOS config:: enabled --
Note: Allocating a small share of IOPS, for example 1000 IOPS, to the long running GPFS maintenance
operations can increase the maintenance command execution times. So, depending on the operation's
needs, the IOPS assigned to the ‘other’ and ‘maintenance’ class must be adjusted by using the mmchqos
command. This balances the application as well as the I/O requirements for the GPFS maintenance
operation.
For more information on setting the QoS for I/O operations, see the mmlsqos command section in the
IBM Storage Scale: Command and Programming Reference Guide and Setting the Quality of Service for I/O
operations (QoS) section in the IBM Storage Scale: Administration Guide.
Problem identification
Check the GPFS log file /var/adm/ras/mmfs.log.latest on the file system manager node mmlsmgr
to verify whether any GPFS maintenance operations are being invoked frequently by a cron job or other
cluster management software like Nagios.
In the sample output below, the mmdf command is being invoked periodically every 3-4 seconds.
Problem identification
Issue the mmlsconfig command and verify whether GPFS tracing is configured. The following sample
output displays a cluster in which tracing is configured:
# mmlsconfig | grep trace
trace all 4 tm 2 thread 1 mutex 1 vnode 2 ksvfs 3 klockl 2 io 3 pgalloc 1 mb 1 lock 2 fsck 3
tracedevOverwriteBufferSize 1073741824
tracedevWriteMode overwrite 268435456
Issue the # ps -aux | grep lxtrace | grep mmfs command to determine whether GPFS tracing
process is running on a node. The following sample output shows that GPFS tracing process is running on
the node:
# ps -aux | grep lxtrace | grep mmfs
Problem identification
When file system replication is enabled and set to 2, effective write performance becomes 50% of the
raw write performance, since for every write operation, there are two internal write operation due to
replication. Similarly, when file system replication is enabled and set to 3, effective write performance
becomes approximately 33% of the raw write performance, since for every write operation, there are
three internal write operation.
Issue the mmlsfs command and verify the default number of metadata and data replicas enabled on the
file system. In the following sample output the metadata and data replication on the file system is set to
2:
Issue the mmlsattr command to check whether replication is enabled at file level
# mmlsattr -L largefile.foo | grep replication
rule 'non-replicate-log-files' SET POOL 'SNCdata' REPLICATE (1) where lower(NAME) like
'%.log'
rule 'default' SET POOL 'SNCdata'
2. Install the placement policy on the file system by using the following command:
Note: You can test the placement policy before installing it by using the following command:
3. Issue one of the following commands to remount the file system for the policy to take effect:
Remount the file system on all the nodes by using one of the following commands:
• mmumount <fs> -N all
• mmmount <fs> -N all
4. Issue the mmlspolicy <fs> -L command to verify whether the output is as shown:
rule 'non-replicate-log-files' SET POOL 'SNCdata' REPLICATE (1) where lower(NAME) like
'%.log'
rule 'default' SET POOL 'SNCdata'
Problem identification
Updating a file that has a snapshot might create unnecessary load on a system because each application
update or write operation goes through the following steps:
1. Read the original data block pertaining to the file region that must be updated.
2. Write the data block read in the step 1 above to the corresponding snapshot location.
3. Perform the application write or update operation on the desired file region.
Inode Information
-----------------
Number of used inodes: 4244
Number of free inodes: 157036
Number of allocated inodes: 161280
Maximum number of inodes: 512000
GPFS operations that involve allocation of data and metadata blocks (that is, file creation and writes)
will slow down significantly if the number of free blocks drops below 5% of the total number. Free
up some space by deleting some files or snapshots (keeping in mind that deleting a file will not
necessarily result in any disk space being freed up when snapshots are present). Another possible
cause of a performance loss is the lack of free inodes. Issue the mmchfs command to increase the
number of inodes for the file system so there is at least a minimum of 5% free. If the file system is
approaching these limits, you may notice the following error messages:
6027-533 [W]
Inode space inodeSpace in file system fileSystem is approaching the limit for the maximum number
of inodes.
operating system error log entry
Jul 19 12:51:49 node1 mmfs: Error=MMFS_SYSTEM_WARNING, ID=0x4DC797C6, Tag=3690419:
File system warning. Volume fs1. Reason: File system fs1 is approaching the limit for the maximum
number of inodes/files.
2. If automated deadlock detection and deadlock data collection are enabled, look in the latest GPFS log
file to determine if the system detected the deadlock and collected the appropriate debug data. Look
in /var/adm/ras/mmfs.log.latest for messages similar to the following:
Thu Feb 13 14:58:09.524 2014: [A] Deadlock detected: 2014-02-13 14:52:59: waiting 309.888 seconds on node
p7fbn12: SyncHandlerThread 65327: on LkObjCondvar, reason 'waiting for RO lock'
Thu Feb 13 14:58:09.525 2014: [I] Forwarding debug data collection request to cluster manager p7fbn11 of
cluster cluster1.gpfs.net
Thu Feb 13 14:58:09.524 2014: [I] Calling User Exit Script gpfsDebugDataCollection: event
deadlockDebugData,
Async command /usr/lpp/mmfs/bin/mmcommon.
Thu Feb 13 14:58:10.625 2014: [N] sdrServ: Received deadlock notification from 192.168.117.21
Thu Feb 13 14:58:10.626 2014: [N] GPFS will attempt to collect debug data on this node.
mmtrace: move /tmp/mmfs/lxtrace.trc.p7fbn12.recycle.cpu0
/tmp/mmfs/trcfile.140213.14.58.10.deadlock.p7fbn12.recycle.cpu0
mmtrace: formatting /tmp/mmfs/trcfile.140213.14.58.10.deadlock.p7fbn12.recycle to
/tmp/mmfs/trcrpt.140213.14.58.10.deadlock.p7fbn12.gz
This example shows that deadlock debug data was automatically collected in /tmp/mmfs. If deadlock
debug data was not automatically collected, it would need to be manually collected.
To determine which nodes have the longest waiting threads, issue this command on each node:
For all nodes that have threads waiting longer than waitTimeInSeconds seconds, issue:
Notes:
a. Each node can potentially dump more than 200 MB of data.
b. Run the mmfsadm dump all command only on nodes that you are sure the threads are really
hung. An mmfsadm dump all command can follow pointers that are changing and cause the node
to crash.
3. If the deadlock situation cannot be corrected, follow the instructions in “Additional information to
collect for delays and deadlocks” on page 556, then contact the IBM Support Center.
gpfs.snap -N GUI_MGMT_SERVERS
Collecting logs and dumps through the gpfs.snap command also collects the GPFS logs. So, manually
getting the logs from the folder /var/log/cnlog/mgtsrv is quicker and provides only the required data
that is required to search for the details of the GUI issue.
3. Issue the following command to check that the installed packages do not have any conflict in versions.
For example,
If there is a version conflict in the installed packages, the output displays the details as shown in the
following example:
postgresql13-contrib-13.6-5.25.1.s390x
postgresql13-server-13.6-5.25.1.s390x
postgresql-server-14-10.6.2.noarch
postgresql13-13.6-5.25.1.s390x
postgresql-contrib-14-10.6.2.noarch
postgresql-14-10.6.2.noarch
postgresql13-contrib-13.6-5.25.1.s390x
postgresql13-server-13.6-5.25.1.s390x
postgresql-server-13-8.30.noarch
postgresql13-13.6-5.25.1.s390x
postgresql-contrib-13-8.30.noarch
postgresql-13-8.30.noarch
4. If the GUI service failed to validate the checkpoint record then issue the following command to reset
the transaction logs.
# pg_resetwal -f /usr/local/var/postgres/
Note: If the pg_resetwal: error: cannot be executed by "root" error occurs then issue
the command in sudo mode. For example,
su postgres
# rm -rf /var/lib/pgsql/data
# su postgres -c 'initdb -D /var/lib/pgsql/data'
#systemctl start postgresql
6. Issue the following command to check if the service status is in running state.
Job for gpfsgui.service failed because the control process exited with error code
When you upgrade from Ubuntu 16.x to Ubuntu 18.x, the PostgreSQL database server might be running
with cluster versions (9.x and 10.x) with the default set to version 9.x. In this scenario, after manual
installation or upgrade of the IBM Storage Scale management GUI, the GUI restarts once before it starts
to run successfully. Systemd reports a startup error of the gpfsgui.service unit. The IBM Storage
Scale GUI clears the database and creates it on the PostgreSQL 10.x instance. This error message can be
ignored.
There can be more lines in the output as given in the following example. The GUI does a self-check on
443 and is automatically redirected to 47443:
Note:
• The IBM Storage Scale GUI WebSphere® Java process no longer runs as root but as a user named
scalemgmt. The GUI process now runs on port 47443 and 47080 and uses iptables rules to forward
port 443 to 47443 and 80 to 47080.
• The port 4444 is used by the GUI CLI to interact with the GUI back-end service. Other ports that are
listed here are used by Java internally.
Note: After migrating from release 4.2.0.x or later to 4.2.1 or later, you might see the pmcollector service
critical error on GUI nodes. In this case, restart the pmcollector service by running the systemctl
restart pmcollector command on all GUI nodes.
SwiftProxy
MCStoreGPFSStats
Transparent Cloud Tiering MCStoreIcstoreStats Cloud gateway nodes
MCStoreLWEStats
DiskFree All nodes
Capacity GPFSFilesetQuota Only a single node
GPFSDiskCap Only a single node
The IBM Storage Scale GUI lists all sensors in the Services > Performance Monotoring > Sensors page.
You can use this view to enable sensors and set appropriate periods and restrictions for sensors. If the
configured values are different from recommended values, such sensors are highlighted with a warning
symbol.
You can query the data displayed in the performance charts through CLI as well. For more information
on how to query performance data displayed in GUI, see “Querying performance data shown in the GUI
through CLI” on page 167.
NTP failure
The performance monitoring fails if the clock is not properly synchronized in the cluster. Issue the ntpq
-c peers command to verify the NTP state.
This can occur owing to a large number of keys being unnecessarily collected. You can check the total
number of keys by issuing the following command:
To resolve, you need to delete the obsolete or expired keys by issuing the following command.
If there are a large number of keys that are queued for deletion, the command may fail to respond. As an
alternative, issue the following command.
The command waits up to one hour for the processing. If you use Docker or something similar which
creates short lived network devices or mount points, those entities can be ignored by using a filter as
shown:
Related concepts
“Performance monitoring using IBM Storage Scale GUI” on page 158
The IBM Storage Scale GUI provides a graphical representation of the status and historical trends of the
key performance indicators. The manner in which information is displayed on the GUI, helps Users to
make quick and effective decisions, easily.
“Performance issues” on page 479
The performance issues might occur because of the system components or configuration or maintenance
issues.
/usr/lpp/mmfs/gui/cli/runtask <task_name>
Note: Many file system-related tasks require the corresponding file system to be mounted on the GUI to
collect data.
For some stale GUI events, to complete the recovery procedure you must run the following commands on
every GUI node that is configured for the cluster:
1. First, run this command: systemctl restart gpfsgui
Prerequisite -
File system
must be
Refresh task Frequency Collected information mounted Invoked by event CLI commands used
AFM_FILESET_STATE 60 The AFM fileset status Yes Any event for mmafmctl getstate -Y
component AFM
AFM_NODE_MAPPING 720 The AFM target map definitions No On execution of mmafmconfig show -Y
mmafmconfig
CES_SERVICE_STATE 1h CES service state in Monitor > Yes mmces service list -N
Nodes page cesNodes -Y
CONNECTION_ STATUS 10 min Connections status in Monitoring Nodes reachable through SSH
> Nodes page
DISKS 1h NSD list in Monitoring > NSDs Yes mmsqrquery, mmlsnsd, and
mmlsdisk
FILESYSTEMS 1h List of file systems in Files > File Yes Yes mmsdrquery, mmlsfs, and
Systems mmlssnapdir
Prerequisite -
File system
must be
Refresh task Frequency Collected information mounted Invoked by event CLI commands used
HEALTH_STATES 10 min Health events in Monitoring > Yes mmhealth node show
Events {component}
-v -N {nodes} -Y
MASTER_GUI_ELECTION 1m Checks if all GUIs in the cluster No HTTP call to other GUIs
are running and elects a new
master GUI if needed.
NFS_SERVICE 1h NFS settings in Settings > NFS Yes mmcesservice list and
Service mmcesnfslscfg
QUOTA 2:15 AM Quotas in Files > Quota Yes Yes mmrepquota and
mmlsdefaultquota
Fileset capacity in Monitoring >
Capacity
Prerequisite -
File system
must be
Refresh task Frequency Collected information mounted Invoked by event CLI commands used
REMOTE_HEALTH 15 m The health states of remote No REST API call to remote GUIs
clusters
_STATES
STORAGE_POOL 1h Pool properties in Files > File Yes mmlspool <device> all
Systems -L -Y
Table 64. Troubleshooting details for capacity data display issues in GUI
GUI page Solution
Files > File SystemsStorage > Pools Verify whether the GPFSPool sensor is enabled on
at least one node and ensure that the file system is
mounted on this node. The health subsystem might
have enabled this sensor already. The default
period for the GPFSPool sensor is 300 seconds (5
minutes).
Files > Filesets does not display fileset capacity In this case, the quota is not enabled for the file
details. system that hosts this fileset. Go to Files > Quotas
page and enable quotas for the corresponding file
system. By default, the quotas are disabled for all
file systems.
GUI automatically logs off the users when using Google Chrome or
Mozilla Firefox
If the GUI is accessed through Google Chrome or Mozilla Firefox browsers and the tab is in the
background for more than 30 minutes, the users get logged out of the GUI.
This issue is reported if no timeout is specified on the Services > GUI > Preferences page. If a timeout
was specified, the GUI session expires when there is no user activity for that period of time, regardless of
the active browser tab.
Note: This issue is reported on Google Chrome version 57 or later, and Mozilla Firefox 58 or later.
Why is a fileset in the Unmounted Filesets that are using a mapping target go to the Disconnected mode if
or Disconnected state when the NFS server of the Primary gateway is unreachable, even if NFS servers
parallel I/O is set up? of all participating gateways are reachable. The NFS server of the Primary
gateway must be checked to fix this problem.
How do I activate an inactive The mmafmctl prefetch command without options, where prefetch
fileset? statistics are procured, activates an inactive fileset.
How do I reactivate a fileset in the The mmafmctl prefetch command without options, where prefetch
Dropped state? statistics are procured, activates a fileset in a dropped state.
How to clean unmount the home To have a clean unmount of the home filesystem, the filesystem must first
filesystem if there are caches using be unmounted on the cache cluster where it is remotely mounted and the
GPFS protocol as backend? home filesystem must be unmounted. Unmounting the remote file system
from all nodes in the cluster might not be possible until the relevant cache
cluster is unlinked or the local file system is unmounted.
Force unmount, shutdown, or crash of the remote cluster results in panic
of the remote filesystem at the cache cluster and the queue is dropped.
The next access to the fileset runs the recovery. However, this should not
affect the cache cluster.
What should be done if the df On RHEL 7.0 or later, df does not support hidden NFS mounts. As AFM
command hangs on the cache uses regular NFS mounts on the gateway nodes, this change causes
cluster? commands like df to hang if the secondary gets disconnected.
The following workaround can be used that allows NFS mounts to
continue to be hidden:
Remove /etc/mtab symlink, and create a new file /etc/mtab and copy /
proc/mounts to /etc/mtab file during the startup. In this solution, the
mtab file might go out of synchronization with /proc/mounts.
Why are setuid or setgid bits in The setuid or setgid bits in a single-writer cache are reset at home
a single-writer cache reset at home after data is appended to files on which those bits were previously set
after data is appended? and synced. This is because over NFS, a write operation to a setuid file
resets the setuid bit.
How can I traverse a directory that On a fileset whose metadata in all subdirectories is not cached, any
is not cached? application that optimizes by assuming that directories contain two
fewer subdirectories than their hard link count do not traverse the last
subdirectory. One such example is find; on Linux, a workaround for this
is to use find -noleaf to correctly traverse a directory that has not been
cached.
What extended attribute size is For an operating system in the gateway whose Linux kernel version is
supported? below 2.6.32, the NFS max rsize is 32K, so AFM does not support an
extended attribute size of more than 32K on that gateway.
What should I do when my file The .ptrash directory is present in cache and home. In some cases,
system or fileset is getting full? where there is a conflict that AFM cannot resolve automatically, the file is
moved to .ptrash at cache or home. In cache the .ptrash gets cleaned
up when eviction is triggered. At home, it is not cleared automatically.
When the administrator is looking to clear some space, the .ptrash must
be cleaned up first.
How to restore an unmounted AFM If the NSD mount on the gateway node is unresponsive, AFM does not
fileset that uses GPFS™ protocol as synchronize data with home. The filesystem might be unmounted at
backend? the gateway node. A message AFM: Remote filesystem remotefs
is panicked due to unresponsive messages on fileset
<fileset_name>,re-mount the filesystem after it becomes
responsive. mmcommon preunmount invoked. File system:
fs1 Reason: SGPanic is written to mmfs.log. After the home is
responsive, you must restore the NSD mount on the gateway node.
Why is a fileset in the Unmounted or Disconnected Filesets that are using a mapping target go to the
state when parallel I/O is set up? Disconnected mode if the NFS server of the MDS is
unreachable, even if NFS servers of all participating
gateways are reachable. The NFS server of the
MDS must be checked to fix this problem.
How to clean unmount of the secondary filesystem To have a clean unmount of secondary filesystem,
fails if there are caches using GPFS protocol as the filesystem should first be unmounted on
backend? the primary cluster where it has been remotely
mounted and then the secondary filesystem should
be unmounted. It might not be possible to
unmount the remote file system from all nodes in
the cluster until the relevant primary is unlinked or
the local file system is unmounted.
Force unmount/shutdown/crash of remote cluster
results panic of the remote filesystem at primary
cluster and queue gets dropped, next access to
fileset runs recovery. However this should not
affect primary cluster.
What does NeedsResync state imply ? NeedsResync state does not necessarily mean
a problem. If this state is during a conversion
or recovery, the problem gets automatically fixed
in the subsequent recovery. You can monitor the
mmafmctl $fsname getstate to check if its
queue number is changing. And also can check the
gpfs logs and for any errors, such as unmounted.
Is there a single command to delete all RPO No. All RPOs need to be manually deleted.
snapshots from a primary fileset?
Suppose there are more than two RPO snapshots Check the queue. Check if recovery happened
on the primary. Where did these snapshots come in the recent past. The extra snapshots will get
from? deleted during subsequent RPO cycles.
How to restore an unmounted AFM DR fileset that If the NSD mount on the gateway
uses GPFS™ protocol as backend? node is unresponsive, AFM DR does
not synchronize data with secondary. The
filesystem might be unmounted at the
gateway node. A message AFM: Remote
filesystem remotefs is panicked due
to unresponsive messages on fileset
<fileset_name>,re-mount the filesystem
after it becomes responsive. mmcommon
preunmount invoked. File system: fs1
Reason: SGPanic is written to mmfs.log. After
the secondary is responsive, you must restore the
NSD mount on the gateway node.
Table 67. Common questions in AFM to cloud object storage with their resolution
Question Answer or Resolution
What can be done if the AFM to The relationship is in the Unmounted state because of the buckets on a
cloud object storage relationship in cloud object storage are not accessible or the gateway can connect to
the Unmounted state? the endpoint but cannot see the buckets and cannot connect. Check that
buckets have correct configuration and keys.
Use the mmafmctl filesystem getstate command to check the
fileset state.
The fileset state is transient. When the network issue or bucket issues are
resolved, the state becomes Active. You need not to use any commands to
change the state.
Why an error message is displayed The No keys for bucket <Bucket_Name> is set for server
while creating the AFM to cloud <Server_name> error message is displayed while creating the AFM to
object storage relationship? cloud object storage relationship. This error occurs because correct keys
are not set for a bucket or no keys are set. Set the keys correctly for the
bucket before the relation creation.
Why operations are requeued on An IBM Storage Scale cluster supports special characters. But, when
the gateway node? a cloud object storage does not support object or bucket names with
special characters, they are requeued. Each cloud object storage provider
must have some limitations for bucket and objects names. For more
information about the limitations, see the documentation that is provided
by cloud object storage providers.
What must be done when the AFM A primary gateway cannot connect to a cloud object storage endpoint.
to cloud object storage relationship Check the endpoint configuration and network connection between a
is disconnected? cluster and the cloud object storage.
What can be done when fileset Callbacks such as lowDiskSpace or noDiskSpace can be set to confirm
space limit is approaching? that the space or fileset or file system is approaching. Provide a storage
and allocate more storage to the pools that are defined.
What can be done when messages When the provisioned space on a cloud object storage is full, messages
in the queue are requeued because can be requeued on a gateway node. Provide more space on the cloud
of cloud object storage is full? object storage and run the mmafmctl resumeRequeued command so
that the requeued messages are run again.
What can be done if there are When the objects are synchronized to a cloud object storage, the
servers on the mmafmtransfer mmafmtransfer command-related waiters can be seen especially for
command but everything looks large objects, or when the application is creating or accessing multiple
normal? objects at a faster pace.
Read seems to be stuck or inflight Check the status of the fileset by using mmafmctl getstate command
for a long time. What should be to see whether the fileset is in the Unmounted state. Check for network
done? errors.
How does the ls command claim When metadata is evicted, any operation that requires metadata, for
the inodes after metadata eviction? example, the ls command reclaims the metadata.
Migration/Recall failures
If a migration or recall fails, simply retry the policy or CLI command that failed two times after clearing the
condition causing the failure. This works because the Transparent cloud tiering service is idempotent.
Starting or stopping Transparent cloud tiering service fails with the Transparent
cloud tiering seems to be in startup phase message
This is typically caused if the Gateway service is killed manually by using the kill command, without the
graceful shutdown by using the mmcloudgateway service stop command.
Adding a cloud account to configure IBM Cloud Object Storage fails with the
following error: 56: Cloud Account Validation failed. Invalid credential for Cloud
Storage Provider. Details: Endpoint URL Validation Failed, invalid username or
password.
Ensure that the appropriate user role is set through IBM Cloud® Object Storage dsNet Manager GUI.
HTTP Error 401 Unauthorized exception while you configure a cloud account
This issue happens when the time between the object storage server and the Gateway node is not synced
up. Sync up the time with an NTP server and retry the operation.
Account creation command fails after a long wait and IBM Cloud Object Storage
displays an error message saying that the vault cannot be created; but the vault is
created
When you look at the IBM Cloud Object Storage manager UI, you see that the vault exists. This problem
can occur if Transparent cloud tiering does not receive a successful return code from IBM Cloud Object
Storage for the vault creation request.
The most common reason for this problem is that the threshold setting on the vault template is incorrect.
If you have 6 IBM Cloud Object Storage slicestors and the write threshold is 6, then IBM Cloud Object
Storage expects that all the slicestors are healthy. Check the IBM Cloud Object Storage manager UI. If any
slicestors are in a warning or error state, update the threshold of the vault template.
Account creation command fails with error MCSTG00065E, but the data vault and
the metadata vault exist
The full error message for this error is as follows:
But the data vault and the metadata vault are visible on the IBM Cloud Object Storage UI.
This error can occur if the metadata vault was created but its name index is disabled. To resolve this
problem, do one of the follow actions:
• Enter the command again with a new vault name and vault template.
• Delete the vault on the IBM Cloud Object Storage UI and run the command again with the correct
--metadata-location.
Note: It is a good practice to disable the name index of the data vault. The name index of the metadata
vault must be enabled.
gpfs.snap: An Error was detected on node XYZ while invoking a request to collect
the snap file for Transparent cloud tiering: (return code: 137).
If the gpfs.snap command fails with this error, increase the value of the timeout parameter by using the
gpfs.snap --timeout Seconds option.
Note: If the Transparent cloud tiering log collection fails after the default timeout period expires, you can
increase the timeout value and collect the TCT logs. The default timeout is 300 seconds (or 5 minutes).
Migration fails with error: MCSTG00008E: Unable to get fcntl lock on inode. Another
MCStore request is running against this inode.
This happens because some other application might be having the file open, while Cloud services are
trying to migrate it.
Connect: No route to host Cannot connect to the Transparent Cloud Tiering service.
Please check that the service is running and that it is reachable over the network.
Could not establish a connection to the MCStore server
During any data command, if this error is observed, it is due to abrupt shutdown of Cloud services
on one of the nodes. This happens when Cloud services is not stopped on a node explicitly using the
mmcloudgateway service stop command, but power of a node goes down or IBM Storage Scale
daemon is taken down. This causes node IP address to be still considered as an active Cloud services
node and, the data commands routed to it fail with this error.
See Completing the upgrade to a new level of IBM Storage Scale in the IBM Storage Scale: Concepts,
Planning, and Installation Guide to get the cluster and file system to enable new functionality.
creating EC2 EIP: AddressLimitExceeded: The maximum number of addresses has been reached."}
iamInstanceProfile issue
Cluster creation fails with the error:
"Failed to connect to the host via ssh: Connection timed out during banner exchange"
4. When cloudkit is executed from hosts which are on-premise and used behind NAT the IP address
displayed in the default option may result in connectivity problems. It is recommended to cross
verify the host public IP during the inputs section.
• If an IBM Storage Scale related problem is suspected, collect data by running a gpfs.snap. Upload this
gpfs.snap to the IBM Storage Scale support ticket that is opened.
Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote
host\", \"unreachable
waiting for EC2 Instance create: timeout while waiting for state to become 'running' (last
state: 'pending', timeout: 10m0s)
Fix: Verify if the instance has finished the boot-up. Cleanup this instance (irrespective of the state) and
retry the cluster creation.
KMS (Key Management Service) key issue
Cluster creation fails with the error when EBS encryption is selected:
Fix: Check if the user executing the cloudkit has read access to the provided KMS key.
"Failed to connect to the host via ssh: Connection timed out during banner exchange"
"Failed to connect to the host via ssh: Connection timed out during banner exchange"
"Cannot upgrade node 10.0.1.199 due to packages dependent on GPFS. If these are known
external dependencies, you can choose to override by setting the environment variable
\"SSFEATURE_OVERRIDE_EXT_PKG_DEPS=true\" environment variable. Instead if you would like to
continue an upgrade on all other nodes using the install toolkit, please remove this node from
the cluster definition via: spectrumscale node delete 10.0.1.199 and then re-run spectrumscale
upgrade. Otherwise, either remove the dependent packages manually or manually upgrade GPFS on
this node."}
Fix: Make sure to use a self-extracting package that does match the IBM Storage Scale edition deployed
on that cluster and rerun the cloudkit create cluster command.
Error waiting to create Router: Error waiting for Creating Router: Quota 'ROUTERS' exceeded.
Limit: 20.0 globally.
4. If the problem still persists after rerunning the command, delete the existing cluster and create a
new cluster.
• If an IBM Storage Scale related problem is suspected, collect data by running a gpfs.snap. Upload this
gpfs.snap to the IBM Storage Scale support ticket that is opened.
• Plan your network infrastructure to ensure a reliable communication between installer node and cloud.
In jump host based connectivity, it could take little longer for ssh to reach the node, if there is a
network drop, it is recommended to re-run.
Plug-in load errors
Cluster creation fails while the terraform plug-in is loading and the following error message appears:
Fix: Manually clean the cloud resources and contact IBM Support.
"Failed to connect to the host via ssh: Connection timed out during banner exchange"
"Failed to connect to the host via ssh: Connection timed out during banner exchange"
Start NSD
The Start NSD DMP assists to start NSDs that are not working.
The following are the corresponding event details and the proposed solution:
• Event ID: disk_down
• Problem: The availability of an NSD is changed to “down”.
• Solution: Recover the NSD.
The DMP provides the option to start the NSDs that are not functioning. If multiple NSDs are down, you
can select whether to recover only one NSD or all of them.
/usr/lpp/mmfs/bin/mmstartup -N <Node>
usr/lpp/mmfs/gui/bin-sudo/sync_node_time <nodeName>
For example,
Issue the mmperfmon config show command to verify whether the NFS sensor is configured properly.
Issue the mmperfmon config show command to verify whether the SMB sensor is configured properly.
The DMP issues the following command for mounting the file system on several nodes if automatic mount
is not included:
The DMP issues the following command for mounting the file system on certain nodes if automatic mount
is not included in those nodes:
Note: Only the users with Storage Administrator, System Administrator, Security Administrator, and
Administrator user roles can launch this DMP.
For example: /usr/lpp/mmfs/bin/mmchfileset r1_FS testFileset --inode-limit 2048
gpfs gpfs_pagepool_ small The actively used • To change the value and make it effective immediately, use the following command::
GPFS pagepool
gpfs_pagepool_ ok
setting (mmdiag -- mmchconfig pagepool=<value> -i
config | grep
pagepool) is lower
than or equal to 1 where <value> is a value higher than 1GB.
GB. • To change the value and make it effective after next GPFS recycle, use the following
command::
mmchconfig pagepool=<value>
AFM component afm_sensors_ inactive Verify that the The period for • To change the period when the sensors are defined in the perfmon configuration file, use
node has a at least one of the following command:
afm_sensors_ active gateway the following AFM
designation and sensors' is set mmperfmon config update <sensor_name>.period=<interval>
a perfmon to 0: GPFSAFM,
designation GPFSAFMFS,
using the GPFSAFMFSET. Where <sensor_name> is one of the AFM sensors GPFSAFM, GPFSAFMFS, or
mmlscluster GPFSAFMFSET, and <interval> is the time in seconds that the sensor waits to gather
command. the different sensors' metrics again.
• To change the period when the sensors are not defined in the perfmon configuration file,
create a sensorsFile with input using the following command:
sensors = {
name = <sensor_name>
period = <interval>
type = "Generic"
}
mmperfmon config add --sensors <path_to_tmp_cfg_file>
NFS component nfs_sensors_ inactive Verify that the The NFS sensor • To change the period when the sensors are defined in the perfmon configuration file, use
node is NFS NFSIO has a period the following command:
nfs_sensors_ active enabled, and of 0.
has a perfmon mmperfmon config update <sensor_name>.period=<interval>
designation
using the
mmlscluster Where <sensor_name> is the NFS sensor NFSIO, and <interval> is the time in
command. seconds that the sensor waits to gather the different sensors' metrics again.
• To change the period when the sensors are not defined in the perfmon configuration file,
use the following command:
SMB component smb_sensors_ inactive Verify that the The period of • To change the period when the sensors are defined in the perfmon configuration file, use
node is SMB at least one of the following command:
smb_sensors_ active enabled, and the following SMB
has a perfmon sensors' is set mmperfmon config update <sensor_name>.period=<interval>
designation to 0: SMBStats,
using the SMBGlobalStats .
mmlscluster Where <sensor_name> is one of the SMB sensors SMBStats or SMBGlobalStats, and
command. <interval> is the time in seconds that the sensor waits to gather the different sensors'
metrics again.
• To change the period when the sensors are not defined in the perfmon configuration file,
use the following command:
gpfs gpfs_maxfilestocache_ Verify that the The actively used • To change the value, use the following command:
small node is in the GPFS
cesNodes node maxFilesToCache mmchconfig maxFilesToCache=<value>; mmshutdown; mmstartup
gpfs_maxfilestocache _ok class using the (mmdiag --
config | grep
maxFilesToCache where <value> is a value higher than 100,000
mmlsnode
class ) setting has a value • To ignore the event, use the following command:
--all smaller than or
equal to 100,000.
mmhealth event hide gpfs_maxfilestocache_small
command.
gpfs gpfs_maxstatcache _high Verify that the The actively • To change the value, use the following command:
node is a Linux used GPFS
gpfs_maxstatcache _ok node. maxStatCache mmchconfig maxStatCache=0; mmshutdown; mmstartup
(mmdiag --
config | grep
maxStatCache) • To ignore the event, use the following command:
value is higher than
0. mmhealth event hide gpfs_maxstatcache_high
gpfs callhome_not_enabled Verify that the Call home is not • To install call home, install the gpfs.callhome-ecc-client-{version-
node is the enabled on the number}.noarch.rpm package for the ECCClient on the potential call home nodes.
callhome_enabled Cluster cluster.
Manager using • To configure the call home package that are installed but not configured:
the mmlsmgr 1. Issue the mmcallhome capability enable command to initialize the
-c command. configuration.
2. Issue the mmcallhome info change command to add personal information.
3. Issue the mmcallhome proxy command to include a proxy if needed.
4. Issue the mmcallhome group add or mmcallhome group auto command to
create call home groups .
• To enable call home once the call home package is installed and the groups are
configured, issue the mmcallhome capability enable command.
For information on tip events, see “Event type and monitoring status for system health” on page 18.
Note: Since the TIP state is only checked once every hour, it might take up to an hour for the change to
reflect in the output of the mmhealth command.
Automatic recovery
The IBM Storage Scale recovers itself from certain issues without manual intervention.
The following automatic recovery options are available in the system:
• Failover of CES IP addresses to recover from node failures. That is, if any important service or protocol
service is broken on a node, the system changes the status of that node to Failed and moves the public
IPs to healthy nodes in the cluster.
A failover gets triggered due to the following conditions:
1. If the IBM Storage Scale monitoring service detects a critical problem in any of the CES components
such as NFS, SMB, or OBJ, then the CES state is set to FAILED and it triggers a failover.
2. If the IBM Storage Scale daemon detects a problem with the node or cluster such as expel node, or
quorum loss, then it runs callbacks and a failover is triggered.
3. The CES framework also triggers a failover during the distribution of IP addresses as specified in the
distribution policy.
• If there are any errors with the SMB and Object protocol services, the system restarts the corresponding
daemons. If restarting the protocol service daemons does not resolve the issue and the maximum retry
count is reached, the system changes the status of the node to Failed. The protocol service restarts are
logged in the event log. Issue the mmhealth node eventlog command to view the details of such
events.
If the system detects multiple problems simultaneously, then it starts the recovery procedure such as
automatic restart, and addresses the issue of the highest priority event first. After the recovery actions
are completed for the highest priority event, the system health is monitored again and then the recovery
actions for the next priority event is started. Similarly, issues for all the events are handled based on
Upgrade recovery
Use this information to recover from a failed upgrade.
A failed upgrade might leave a cluster in a state of containing multiple code levels. It is important to
analyze console output to determine which nodes or components were upgraded prior to the failure and
which node or component was in the process of being upgraded when the failure occurred.
Once the problem has been isolated, a healthy cluster state must be achieved prior to continuing the
upgrade. Use the mmhealth command in addition to the mmces state show -a command to verify
that all services are up. It might be necessary to manually start services that were down when the
upgrade failed. Starting the services manually helps achieve a state in which all components are healthy
prior to continuing the upgrade.
For more information about verifying service status, see mmhealth command and mmces state show
command in IBM Storage Scale: Command and Programming Reference Guide.
# mmlscluster
The entire content of /var/mmfs/ is deleted on the node node‑23 to simulate this case, then the
mmgetstate on the node to be recovered returns the following output:
# mmgetstate
mmgetstate: This node does not belong to a GPFS cluster.
mmgetstate: Command failed. Examine previous error messages to determine cause.
# mmgetstate -a
Run the mmccr check command on the node to be recovered as shown in the following example:
In this case, you can recover this node by using the mmsdrrestore command with the -p option. The
-p option must specify a healthy quorum node from which the necessary files can be transferred. The
mmsdrrestore command must run on the node to be recovered as shown in the following example:
# mmsdrrestore -p node-21
genkeyData1
Immediately after the mmsdrrestore command completes, the mmgetstate command still reports
that the GPFS is down. However, you can start the GPFS now on the recovered node. The mmgetstate
command then shows GPFS as active as shown in the following example:
# mmgetstate
The output of the mmccr check command on the recovered shows a healthy status as shown in the
following example:
# mmgetstate -a
mmgetstate: [E] The command was unable to reach the CCR service on the majority of quorum
nodes to form CCR quorum. Ensure the CCR service (mmfsd or mmsdrserv daemon) is running on
all quorum nodes and the communication port is not blocked by a firewall.
mmgetstate: Command failed. Examine previous error messages to determine cause.
2. Issue the mmlscluster --noinit command to identify the quorum nodes in the cluster as shown in
the following example:
# mmlscluster --noinit
3. Issue the ping command to verify whether the lost quorum nodes are reachable:
# ping -c 1 node-22.localnet.com
PING node-22.localnet.com (10.0.100.22) 56(84) bytes of data.
From node-21.localnet.com (10.0.100.21) icmp_seq=1 Destination Host Unreachable
# ping -c 1 node-23.localnet.com
PING node-23.localnet.com (10.0.100.23) 56(84) bytes of data.
From node-21.localnet.com (10.0.100.21) icmp_seq=1 Destination Host Unreachable
4. Issue the mmccr check on the remaining quorum node to get the details of the missing quorum
nodes and a quorum loss (809) of the CCR server, which is running on the local node:
5. Issue the mmchnode command with the --force option to force the system to reduce the number
of quorum nodes to the still available quorum nodes. This command takes a while and expects a
confirmation to proceed.
The --force option enforces the GPFS to continue run normally by using the copy of the CCR state
found on the only remaining quorum node. As CCR no longer has quorum, GPFS cannot verify whether
it is the most recent version of the CCR state. If the other two quorum nodes are failed while a GPFS
After the command returned successfully, the cluster is back to a working state because CCR is able to
reach quorum without the quorum nodes that are no longer available. The failed nodes are still in the
list of cluster nodes.
6. Issue the mmdelnode command as shown in the following example to remove the failed nodes:
# mmlscluster
# mmgetstate -a
# mmlscluster
You can also use the mmhealth node show instead of using the mmlscluster --noinit
command to get the list of quorum nodes. The mmhealth node show command provides the status
of the IBM Storage Scale components as shown in the following example:
In addition, the mmhealth node show <COMPONENT> -v --unhealthy lists more details about
the specified component. You can find the IP addresses of the unavailable quorum nodes from the
command output:
2. Issue the mmsdrrestore command with the --ccr-repair option to repair CCR. A sample output is
as follows:
# mmsdrrestore --ccr-repair
mmsdrrestore: Checking CCR on all quorum nodes ...
mmsdrrestore: Invoking CCR restore in dry run mode ...
ccrrestore: +++ DRY RUN: CCR state on quorum nodes will not be restored +++
ccrrestore: 1/8: Test tool chain successful
ccrrestore: 2/8: Setup local working directories successful
ccrrestore: 3/8: Copy Paxos state files from quorum nodes successful
ccrrestore: 4/8: Getting most recent Paxos state file successful
ccrrestore: 5/8: Get cksum of files in committed directory successful
ccrrestore: 6/8: WARNING: Intact ccr.nodes file with version 5 missing in committed
directory
ccrrestore: 6/8: INFORMATION: Intact ccr.disks found (file id: 2 version: 1)
ccrrestore: 6/8: INFORMATION: Intact mmLockFileDB found (file id: 3 version: 1)
ccrrestore: 6/8: INFORMATION: Intact genKeyData found (file id: 4 version: 1)
ccrrestore: 6/8: INFORMATION: Intact genKeyDataNew found (file id: 5 version: 2)
ccrrestore: 6/8: INFORMATION: Intact mmsdrfs found (file id: 6 version: 23)
ccrrestore: 6/8: INFORMATION: Intact mmsysmon.json found (file id: 7 version: 1)
ccrrestore: 6/8: Parsing committed file list successful
ccrrestore: 7/8: Pulling committed files from quorum nodes successful
ccrrestore: 8/8: File name: 'ccr.nodes' file state: UPDATED remark: 'OLD (v5,
((n1,e6),103), f20ea9e3)'
ccrrestore: 8/8: File name: 'ccr.disks' file state: MATCHING remark: 'none'
ccrrestore: 8/8: File name: 'mmLockFileDB' file state: MATCHING remark: 'none'
ccrrestore: 8/8: File name: 'genKeyData' file state: MATCHING remark: 'none'
ccrrestore: 8/8: File name: 'genKeyDataNew' file state: MATCHING remark: 'none'
ccrrestore: 8/8: File name: 'mmsdrfs' file state: MATCHING remark: 'none'
ccrrestore: 8/8: File name: 'mmsysmon.json' file state: MATCHING remark: 'none'
ccrrestore: 8/8: Patching Paxos state successful
mmsdrrestore: Review the dry run report above to see what will be changed and decide if you
want to continue the restore or not. Do you want to continue? (yes/no) yes
ccrrestore: 1/14: Test tool chain successful
ccrrestore: 2/14: Test GPFS shutdown successful
ccrrestore: 3/14: Setup local working directories successful
ccrrestore: 4/14: Archiving CCR directories on quorum nodes successful
ccrrestore: 5/14: Kill GPFS mmsdrserv daemon successful
ccrrestore: 6/14: Copy Paxos state files from quorum nodes successful
ccrrestore: 7/14: Getting most recent Paxos state file successful
ccrrestore: 8/14: Get cksum of files in committed directory successful
ccrrestore: 9/14: WARNING: Intact ccr.nodes file with version 5 missing in committed
directory
ccrrestore: 9/14: INFORMATION: Intact ccr.disks found (file id: 2 version: 1)
ccrrestore: 9/14: INFORMATION: Intact mmLockFileDB found (file id: 3 version: 1)
ccrrestore: 9/14: INFORMATION: Intact genKeyData found (file id: 4 version: 1)
ccrrestore: 9/14: INFORMATION: Intact genKeyDataNew found (file id: 5 version: 2)
ccrrestore: 9/14: INFORMATION: Intact mmsdrfs found (file id: 6 version: 23)
ccrrestore: 9/14: INFORMATION: Intact mmsysmon.json found (file id: 7 version: 1)
ccrrestore: 9/14: Parsing committed file list successful
ccrrestore: 10/14: Pulling committed files from quorum nodes successful
ccrrestore: 11/14: File name: 'ccr.nodes' file state: UPDATED remark: 'OLD (v5,
((n1,e6),103), f20ea9e3)'
ccrrestore: 11/14: File name: 'ccr.disks' file state: MATCHING remark: 'none'
ccrrestore: 11/14: File name: 'mmLockFileDB' file state: MATCHING remark: 'none'
ccrrestore: 11/14: File name: 'genKeyData' file state: MATCHING remark: 'none'
ccrrestore: 11/14: File name: 'genKeyDataNew' file state: MATCHING remark: 'none'
ccrrestore: 11/14: File name: 'mmsdrfs' file state: MATCHING remark: 'none'
ccrrestore: 11/14: File name: 'mmsysmon.json' file state: MATCHING remark: 'none'
ccrrestore: 11/14: Patching Paxos state successful
ccrrestore: 12/14: Pushing CCR files successful
3. Issue the mmccr check command as shown in the following example to check the status of the CCR:
Important: The CCR restore script recovers the CCR from the fragments of CCR configuration files
that are available in the cluster nodes. The recovered CCR might be having the details of an old
cluster configuration. If a recent backup is available, it might be better to use that backup, even if
mmsdrrestore --ccr-repair is able to restore from available fragments.
Note: In this example, the entire /var/mmfs directory is deleted on all nodes in the cluster to
simulate this case.
2. Issue the mmsdrrestore command with -F and -a options as shown in the following example to
restore backup:
# mmsdrrestore -F /root/CCRbackup_20210708115924.tar.gz -a
# mmsdrrestore -p node-21
genkeyData1
# mmsdrrestore -p node-21
genkeyData1
4. Verify the GPFS state at the cluster level by using the mmgetstate command as shown in the
following example:
# mmgetstate -a
Note: Based on the age of the used CCR backup file, it is possible that the cluster might be recovered to
an old cluster configuration. It is recommended to take regular backup of CCR.
mmccr check -Y -e
In the following example, the next-to-last line of the output indicates that one or more files are corrupted
or lost in the CCR committed directory of the current node:
# mmccr check -Y -e
mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:
ListOfFailedEntities:ListOfSucceedEntities:Severity:
mmccr::0:1:::1:CCR_CLIENT_INIT:0:::/var/mmfs/ccr,/var/mmfs/ccr/committed,/var/mmfs/ccr/
ccr.nodes,
Security,/var/mmfs/ccr/ccr.disks:OK:
mmccr::0:1:::1:FC_CCR_AUTH_KEYS:0:::/var/mmfs/ssl/authorized_ccr_keys:OK:
mmccr::0:1:::1:FC_CCR_PAXOS_CACHED:0:::/var/mmfs/ccr/cached,/var/mmfs/ccr/cached/ccr.paxos:OK:
mmccr::0:1:::1:FC_CCR_PAXOS_12:0:::/var/mmfs/ccr/ccr.paxos.1,/var/mmfs/ccr/ccr.paxos.2:OK:
mmccr::0:1:::1:PC_LOCAL_SERVER:0:::c80f5m5n01.gpfs.net:OK:
In the following example, the next-to-last line indicates that none of the files in the CCR committed
directory of the current node are corrupted or lost:
# mmccr check -Y -e
mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:
ListOfFailedEntities:ListOfSucceedEntities:Severity:
mmccr::0:1:::1:CCR_CLIENT_INIT:0:::/var/mmfs/ccr,/var/mmfs/ccr/committed,/var/mmfs/ccr/
ccr.nodes,
Security,/var/mmfs/ccr/ccr.disks:OK:
mmccr::0:1:::1:FC_CCR_AUTH_KEYS:0:::/var/mmfs/ssl/authorized_ccr_keys:OK:
mmccr::0:1:::1:FC_CCR_PAXOS_CACHED:0:::/var/mmfs/ccr/cached,/var/mmfs/ccr/cached/ccr.paxos:OK:
mmccr::0:1:::1:FC_CCR_PAXOS_12:0:::/var/mmfs/ccr/ccr.paxos.1,/var/mmfs/ccr/ccr.paxos.2:OK:
mmccr::0:1:::1:PC_LOCAL_SERVER:0:::c80f5m5n01.gpfs.net:OK:
mmccr::0:1:::1:PC_IP_ADDR_LOOKUP:0:::c80f5m5n01.gpfs.net,0.000:OK:
mmccr::0:1:::1:PC_QUORUM_NODES:0:::192.168.80.181,192.168.80.182:OK:
mmccr::0:1:::1:FC_COMMITTED_DIR:0::0:7:OK:
mmccr::0:1:::1:TC_TIEBREAKER_DISKS:0:::1:OK:
errpt -a
• On a Linux node, create a tar file of all the entries in the /var/log/messages file from all
nodes in the cluster or the nodes that experienced the failure. For example, issue the following
command to create a tar file that includes all nodes in the cluster:
• On a Windows node, use the Export List... dialog in the Event Viewer to save the event log to a
file.
b. A master GPFS log file that is merged and chronologically sorted for the date of the failure (see
“Creating a master GPFS log file” on page 258).
c. If the cluster was configured to store dumps, then collect any internal GPFS dumps written to that
directory relating to the time of the failure. The default directory is /tmp/mmfs.
rpm -qa
e. On a failing AIX node, gather the name, most recent level, state, and description of all installed
software packages by issuing this command:
lslpp -l
f. File system attributes for all of the failing file systems, issue:
mmlsfs Device
g. The current configuration and state of the disks for all of the failing file systems, issue:
mmlsdisk Device
gpfs.snap --deadlock
If the cluster size is large or the maxFilesToCache setting is high (greater than 1M), then issue the
following command:
mmtracectl --start
5. Recreate the problem when possible or wait for the assert to be triggered again.
6. Once the assert is encountered on the node, turn off the trace facility by issuing:
mmtracectl --off
If traces were started on multiple clusters, then issue the mmtracectl --off command immediately
on all clusters.
7. Collect gpfs.snap output:
gpfs.snap
When you contact the IBM Support Center, the following will occur:
1. You will be asked for the information you collected in “Information to be collected before
contacting the IBM Support Center” on page 555.
Events
The recorded events are stored in the local database on each node. The user can get a list of recorded
events by using the mmhealth node eventlog command. Users can use the mmhealth node
show or mmhealth cluster show commands to display the active events in the node and cluster
respectively.
The recorded events can also be displayed through the GUI.
When you upgrade to IBM Storage Scale 5.0.5.3 or a later version, the nodes where no sqlite3 package
is installed have their RAS event logs converted to a new database format to prevent known issues. The
old RAS event log is emptied automatically. You can verify that the event log is emptied either by using the
mmhealth node eventlog command or in the IBM Storage Scale GUI.
Note: The event logs are updated only the first time IBM Storage Scale is upgraded to version 5.0.5.3 or
higher.
The following sections list the RAS events that are applicable to various components of the IBM Storage
Scale system:
AFM events
The following table lists the events that are created for the AFM component.
Table 72. Events for the AFM component
Description: The AFM cache fileset is not connected to its home server.
Cause: Shows that the connectivity between the AFM Gateway and the
mapped home server is lost.
User Action: The user action is based on the source of the disconnectivity.
Check the settings on both home and cache sites and correct the
connectivity issues. The state automatically changes to ACTIVE state after
solving the issues.
User Action: There are many reasons that can cause the cache to go to
DROPPED state. For more information, see the Monitoring fileset states for
AFM (DR) section in the IBM Storage Scale: Problem Determination Guide.
afm_cache_expired INFO ERROR no Message: Fileset {0} in {1} mode is now in the EXPIRED state.
User Action: Check the network connectivity to the home server as well as
the home server availability.
afm_cache_inactive STATE_CHANGE INFO no Message: The AFM cache fileset {0} is in the INACTIVE state.
Cause: N/A
afm_cache_recovery STATE_CHANGE WARNING no Message: The AFM cache fileset {0} in {1} mode is in the RECOVERY state.
Description: In this state, the AFM cache fileset recovers from a previous
failure and identifies changes that need to be synchronized to its home
server.
User Action: This state automatically changes back to ACTIVE when the
recovery is finished.
afm_cache_unmounted STATE_CHANGE ERROR no Message: The AFM cache fileset {0} is in unmounted state.
Cause: The AFM cache fileset is in this state when either the home server's
NFS-mount is not accessible, home server's exports are not exported
properly, or home server's export does not exist.
User Action: Resolve issues on the home server's site. Afterwards, this
state changes automatically.
afm_cache_up STATE_CHANGE INFO no Message: An 'Active' or 'Dirty' status is expected in the mmdiag --afm
command output, and the output shows that the cache is in a HEALTHY
state.
Cause: N/A
afm_cmd_requeued STATE_CHANGE WARNING no Message: Messages are requeued on the AFM fileset {0}. Details: {1}.
User Action: It is usually a transient state. Track this event. If the problem
remains, then contact IBM Support.
afm_event_connected STATE_CHANGE INFO no Message: The AFM node {0} has regained connection to the home site.
Details: {1}.
Cause: N/A
afm_event_disconnected STATE_CHANGE ERROR no Message: The AFM node {0} has lost connection to the home site. Fileset
{1}. Details: {2}.
User Action: Check the network connectivity to the home server as well as
the home server availability.
afm_failback_complete STATE_CHANGE WARNING no Message: The AFM cache fileset {0} in {1} mode is in the
FailbackCompleted state.
Cause: The independent writer failback is finished and needs further user
actions.
afm_failback_needed STATE_CHANGE ERROR no Message: The AFM cache fileset {0} in {1} mode is in the NeedFailback
state.
afm_failback_running STATE_CHANGE WARNING no Message: The AFM cache fileset {0} in {1} mode is in the
FailbackInProgress state.
User Action: No user action is needed at this point. After completion, the
state automatically changes to the FailbackCompleted state.
afm_failover_running STATE_CHANGE WARNING no Message: The AFM cache fileset {0} is in FailoverInProgress state.
User Action: No user action is needed at this point. The cache state is
moved automatically to the ACTIVE state when the failover is completed.
Cause: N/A
Cause: N/A
Cause: N/A
afm_fileset_expired INFO WARNING no Message: The contents of the AFM cache fileset {0} are expired.
Cause: The contents of a fileset expire either as a result of the fileset being
disconnected for the expiration timeout value or when the fileset is marked
as expired using the AFM administration commands. This event is triggered
through an AFM callback.
User Action: Check why the fileset is disconnected to refresh the contents.
afm_fileset_found INFO_ADD_ENTITY INFO no Message: The AFM fileset {0} was found.
Cause: An AFM fileset was detected through the appearance of the fileset
in the mmdiag --afm command output.
Cause: N/A
afm_fileset_unexpired INFO INFO no Message: The contents of the AFM cache fileset {0} are unexpired.
Description: The contents of the AFM cache filesets did not expire,
and available for operations. This event is triggered when the home is
reconnected, and cache contents are available, or the administrator runs
the mmafmctl unexpire command on the cache fileset. This event is
triggered through an AFM callback.
Cause: N/A
Cause: N/A
afm_fileset_unmounted STATE_CHANGE ERROR no Message: The AFM fileset {0} was unmounted because the remote side is
not reachable. Details: {1}.
Cause: After 300 seconds, the cache retries to connect to home, and it
moves to the Active state. If AFM is using the native GPFS protocol as
target, the cache state is moved to the Unmounted state because the local
mount of the remote file system is not accessible.
User Action: Remount the remote file system on the local cache cluster.
afm_fileset_vanished INFO_DELETE_ENTITY INFO no Message: The AFM fileset {0} has vanished.
Cause: The AFM fileset is not in use anymore. This is detected through the
absence of the fileset in the mmdiag --afm command output.
afm_flush_only INFO INFO no Message: The AFM cache fileset {0} is in the FlushOnly state.
Description: Indicates that operations are queued, but have not started to
flush to the home server.
Cause: N/A
afm_home_connected STATE_CHANGE INFO no Message: The AFM fileset {0} has regained connection to the home site.
Details: {1}.
Cause: N/A
afm_home_disconnected STATE_CHANGE ERROR no Message: The AFM fileset {0} has lost connection to the home site. Details:
{1}.
User Action: Check the network connectivity to the home server as well as
the home server availability.
Description: Clear TIPS events from AFM. The .pconflict directory is clean.
Cause: N/A
afm_pconflicts_storage TIP TIP no Message: The fileset {0} .pconflicts directory contains user data. Examine
and remove unused files to free storage.
afm_prim_init_fail STATE_CHANGE ERROR no Message: The AFM cache fileset {0} is in the PrimInitFail state.
Cause: This rare state appears if the initial creation of psnap0 on the
primary cache fileset failed.
User Action: Check whether the fileset is available and exported to be used
as primary. The gateway node should be able to access this mount and the
primary ID should be setup on the secondary gateway. You may try running
the mmafmctl converToPrimary command on the primary fileset again.
afm_prim_init_running STATE_CHANGE WARNING no Message: The AFM primary cache fileset {0} is in the PrimInitProg state.
Cause: This AFM cache fileset is a primary fileset and synchronizing the
content of psnap0 to the secondary AFM cache fileset.
User Action: This state changes back to 'Active' automatically when the
synchronization is finished.
Description: Clear TIPS events from AFM. The .ptrash directory is clean.
Cause: N/A
afm_ptrash_storage TIP TIP no Message: The fileset {0} .ptrash directory contains user data. Examine and
remove unused files to free storage.
afm_queue_dropped STATE_CHANGE ERROR no Message: The AFM cache fileset {0} encountered an error synchronizing
with its remote cluster. Details: {1}.
Cause: This event occurs when a queue is dropped on the gateway node.
afm_queue_only STATE_CHANGE INFO no Message: The changes of AFM cache fileset {0} in {1} mode are not flushed
yet to home.
Cause: N/A
afm_recovery_failed STATE_CHANGE ERROR no Message: AFM recovery on fileset {0} failed with error {1}.
User Action: Recovery is retried on next access after the recovery retry
interval. Alternatively, you can manually resolve known problems and
recover the fileset.
afm_recovery_finished STATE_CHANGE INFO no Message: A recovery process ended for the AFM cache fileset {0}.
Cause: N/A
afm_recovery_running STATE_CHANGE WARNING no Message: AFM fileset {0} is triggered for recovery start.
User Action: The cache fileset state moves to the healthy state when
recovery is complete. Monitor this event.
afm_resync_needed STATE_CHANGE WARNING no Message: The AFM cache fileset {0} in {1} mode is in the NeedsResync
state.
Cause: The AFM cache fileset detects some accidental corruption of data
on the home server.
afm_rpo_miss STATE_CHANGE_EXTE WARNING no Message: The AFM recovery point objective (RPO) is missed for {id}.
RNAL
Description: The primary fileset is triggering an RPO snapshot which is
expected to complete within a specified interval (RPO). This time interval is
exceeded.
afm_sensors_active TIP INFO no Message: The AFM perfmon sensor {0} is active.
Description: The AFM perfmon sensors are active. This event's monitor is
running only once an hour.
Cause: The value of the AFM perfmon sensors' period attribute is greater
than 0.
afm_sensors_inactive TIP TIP no Message: The following AFM perfmon sensor {0} is inactive.
Description: The AFM perfmon sensors are inactive. This event's monitor is
running only once an hour.
afm_sensors_not_configured TIP TIP no Message: The AFM perfmon sensor {0} is not configured.
Description: The AFM perfmon sensor does not exist in the mmperfmon
config show command output.
User Action: Include the sensors into the perfmon configuration by using
the mmperfmon config add --sensors SensorFile command.
An example for the configuration file can be found in the mmperfmon
command page.
Authentication events
The following table lists the events that are created for the AUTH component.
Table 73. Events for the Auth component
ad_smb_nfs_ready STATE_CHANGE INFO no Message: SMB and NFS monitoring has started.
ad_smb_not_yet_ready STATE_CHANGE WARNING no Message: AD authentication is configured, but the SMB monitoring is not
yet ready.
Cause: SMB monitoring has not started yet, but NFS is ready to process
requests.
User Action: Check as to why the SMB or CTDB is not yet running. This
problem might be caused by a temporary issue.
ads_cfg_entry_warn INFO WARNING no Message: {0} returned unknown result for item {1} query.
Cause: An internal error occurred while querying the external DNS server.
ads_down STATE_CHANGE ERROR FTDC Message: The external ADS server is unresponsive.
uploa
d Description: The external ADS server is unresponsive.
Cause: The local node is unable to connect to any Active Directory Service
(ADS) server.
User Action: Verify the network connection and check whether the ADS
server is operational.
ads_failed STATE_CHANGE ERROR FTDC Message: The local winbindd service is unresponsive.
uploa
d Description: The local winbindd service is unresponsive.
Cause: The local winbindd service, which is needed for ADS, is not
responding to ping requests.
User Action: Restart the winbindd service. If the service restart is not
successful, then perform the winbindd troubleshooting.
Cause: N/A
ads_warn INFO WARNING no Message: The external ADS server monitoring returned an unknown result.
Cause: An internal error occurred while monitoring the external ADS server.
User Action: Check the nameserver configuration for the missing AD-
specific settings.
dns_query_proto_fail TIP WARNING no Message: The {0} query failed for UDP and TCP.
dns_query_proto_ok TIP INFO no Message: The {0} query succeeded with UDP and TCP protocols.
ldap_down STATE_CHANGE ERROR no Message: The external LDAP server {0} is unresponsive.
User Action: Verify the network connection and check whether the LDAP
server is operational.
ldap_up STATE_CHANGE INFO no Message: The external LDAP server {0} is up.
Cause: N/A
nis_down STATE_CHANGE ERROR no Message: The external NIS server {0} is unresponsive.
User Action: Verify the network connection and check whether the NIS
server is operational.
User Action: Restart ypbind daemon. If the restart is not successful, then
perform ypbind troubleshooting.
nis_up STATE_CHANGE INFO no Message: The external NIS server {0} is up.
Cause: N/A
nis_warn INFO WARNING no Message: The external NIS monitoring returned unknown result.
sssd_restart INFO INFO no Message: The SSSD process is not running. Trying to start the SSSD
process.
Cause: N/A
sssd_warn INFO WARNING no Message: The SSSD process monitoring returned unknown result.
wnbd_restart INFO INFO no Message: WINBINDD process was not running. Trying to start the
WINBINDD process.
Cause: N/A
wnbd_warn INFO WARNING no Message: The WINBINDD process monitoring returned unknown result.
yp_restart INFO INFO no Message: The YPBIND process was not running. Trying to start the YPBIND
process.
Cause: N/A
yp_warn INFO WARNING no Message: The YPBIND process monitoring returned unknown result.
callhome_customer_info_disa TIP INFO no Message: The required customer information for call home is not checked.
bled
Description: The required customer information for call home is not
checked.
callhome_customer_info_fille TIP INFO no Message: All required customer information for call home was provided.
d
Description: All required customer information for call home was provided.
Cause: All required customer information for call home was provided by
using the mmcallhome info change command.
callhome_customer_info_mis TIP TIP no Message: The required customer information is not provided: {0}.
sing
Description: Some of the required customer information was not provided,
but is required. For more information, see the Configuring call home to
enable manual and automated data upload section in the IBM Storage
Scale: Administration Guide.
User Action: Run the mmcallhome info change command to collect the
required information about the customer for call home capability.
callhome_hcalerts_ccr_failed STATE_CHANGE ERROR no Message: The data for the last health check monitoring cannot be updated
in the CCR.
Description: The data for the last health check monitoring cannot be
updated in the CCR.
Cause: The data for the last health check monitoring cannot be updated in
the Cluster Configuration Repository (CCR).
User Action: Ensure that your cluster has a quorum. If the cluster has
a quorum, then repair the CCR by following the steps mentioned in the
Repair of cluster configuration information when no CCR backup is available
section in the IBM Storage Scale: Problem Determination Guide.
callhome_hcalerts_disabled STATE_CHANGE INFO no Message: The health check monitoring feature is disabled.
User Action: To enable health check monitoring, you must enable call
home by using the mmcallhome capability enable command and set
'monitors_enabled = true' in the mmsysmonitor.conf file.
callhome_hcalerts_failed STATE_CHANGE ERROR no Message: The last health check monitoring was not successfully
processed.
Cause: The last health check monitoring was not successfully processed.
User Action: Check connectivity to the IBM ECuRep server, which includes
cabling, firewall, and proxy. For more information, check the output
of the mmcallhome status list --task sendfile --verbose
command.
callhome_hcalerts_noop STATE_CHANGE INFO no Message: No health check monitoring operation is performed on this node.
callhome_hcalerts_ok STATE_CHANGE INFO no Message: Call Home health check monitoring was successfully performed.
callhome_heartbeat_collectio STATE_CHANGE ERROR no Message: The data for the last call home heartbeat cannot be collected.
n_failed
Description: The data for the last call home heartbeat cannot be collected.
Cause: The data for the last call home heartbeat cannot be collected.
User Action: Check whether you have enough free space in the
dataStructureDump.
User Action: To enable heartbeats, you must enable the call home
capability by using the mmcallhome capability enable command.
callhome_heartbeat_failed STATE_CHANGE ERROR no Message: The last call home heartbeat was not successfully sent.
Description: The last call home heartbeat was not successfully sent.
Cause: The last call home heartbeat was not successfully sent.
User Action: Check connectivity to the IBM ECuRep server, which includes
cabling, firewall, and proxy. For more information, check the output
of the mmcallhome status list --task sendfile --verbose
command.
callhome_heartbeat_ok STATE_CHANGE INFO no Message: Call Home heartbeats are successfully sent.
callhome_ptfupdates_ccr_fail STATE_CHANGE ERROR no Message: The data for the last ptf update check cannot be updated in the
ed CCR.
Description: The data for the last ptf update check cannot be updated in
the CCR.
Cause: The data for the last ptf update check cannot be updated in the
CCR.
User Action: Ensure that your cluster has a quorum. If the cluster has
a quorum, then repair the CCR by following the steps mentioned in the
Repair of cluster configuration information when no CCR backup is available
section of the IBM Storage Scale: Problem Determination Guide .
callhome_ptfupdates_disable STATE_CHANGE INFO no Message: The ptf update check feature is disabled.
d
Description: The ptf update check feature is disabled.
User Action: To enable ptf update, you must enable call home by using the
mmcallhome capability enable command and set 'monitors_enabled
= true' in the mmsysmonitor.conf file.
callhome_ptfupdates_failed STATE_CHANGE ERROR no Message: The last ptf update check was not successfully processed.
Description: The last ptf update check was not successfully processed.
Cause: The last ptf update check was not successfully processed.
User Action: Check connectivity to the IBM ECuRep server, which includes
cabling, firewall, and proxy. For more information, check the output
of the mmcallhome status list --task sendfile --verbose
command.
callhome_ptfupdates_noop STATE_CHANGE INFO no Message: No Call Home ptf update check operation on this node.
Cause: The call home ptf update check is performed exclusively on the first
call home master node considering that it is not running in the cloud native
storage architecture.
callhome_ptfupdates_ok STATE_CHANGE INFO no Message: Call Home ptf update check was successfully performed.
callhome_sudouser_defined STATE_CHANGE INFO no Message: The sudo user variable is properly set up.
callhome_sudouser_not_exist STATE_CHANGE ERROR no Message: The sudo user '{0}' does not exist on this node.
s
Description: The sudo user does not exist on this node.
User Action: Create the sudo user on this node or specify a new sudo user
by using the mmchcluster --sudo-user <userName> command.
callhome_sudouser_not_need STATE_CHANGE INFO no Message: Not monitoring sudo user configuration variable, since sudo
ed wrappers are not being used.
callhome_sudouser_perm_mi STATE_CHANGE ERROR no Message: The sudo user is missing a recursive execute permission for the
ssing dataStructureDump: {0}.
Cause: The sudo user, specified in the IBM Storage Scale configuration,
cannot read or write call home directories in dataStructureDump.
User Action: Ensure that the sudo user, specified in the IBM
Storage Scalesettings, has the recursive execute permission for the
dataStructureDump.
callhome_sudouser_perm_no STATE_CHANGE INFO no Message: Not monitoring sudo user permissions, since sudo wrappers are
t_needed not being used.
callhome_sudouser_perm_ok STATE_CHANGE INFO no Message: The sudo user has correct permissions for the
dataStructureDump: {0}.
Cause: The sudo user, specified in the IBM Storage Scale configuration, can
read and write call home directories in the dataStructureDump.
callhome_sudouser_undefine STATE_CHANGE ERROR no Message: The sudo user variable is not set up in the IBM Storage Scale
d configuration.
Description: The sudo user variable is not set up in the IBM Storage Scale
configuration.
User Action: Specify a valid non-root sudo user by using the mmchcluster
--sudo-user <userName> command.
ces_bond_degraded STATE_CHANGE WARNING no Message: Some secondaries of the CES-bond {0} went down.
ces_bond_down STATE_CHANGE ERROR no Message: All secondaries of the CES-bond {0} are down.
ces_bond_up STATE_CHANGE INFO no Message: All secondaries of the CES bond {0} are working as expected.
Cause: N/A
ces_ips_hostable TIP INFO no Message: All declared CES-IPs could be hosted on this node.
Cause: N/A
ces_ips_not_hostable TIP TIP no Message: One or more CES IP cannot be hosted on this node (no interface).
User Action: Check whether interfaces are active and the CES group
assignment is correct. For more information, run the mmces address
list --full-list command.
Description: CES monitor daemon cannot run the ces load monitor
successfully.
Description: CES monitor daemon runs ces load monitor with success.
Cause: N/A
ces_many_tx_errors STATE_CHANGE ERROR FTDC Message: CES NIC {0} had many TX errors since the last monitoring cycle.
uploa
d Description: This CES-related NIC had many TX errors since the last
monitoring cycle.
Cause: The /proc/net/dev lists more TX errors for this adapter since the
last monitoring cycle.
ces_monitord_down STATE_CHANGE WARNING no Message: The CES-IP background monitor is not running. CES-IPs cannot
be configured.
User Action: If the CES-IP background monitor stops without any known
reason, check the local /var file system. Restart it by using the mmces
node resume --start command.
Cause: N/A
ces_monitord_warn INFO WARNING no Message: The IBM Storage Scale CES IP assignment monitor
(mmcesmonitord) alive check cannot be executed, which can be a timeout
issue.
ces_network_affine_ips_not_ STATE_CHANGE WARNING no Message: No CES IP addresses can be applied on this node. Check group
defined membership of node and IP addresses.
User Action: Use the mmces address add to add CES IP addresses to
the global pool or to a group for which this node is a member.
ces_network_connectivity_do STATE_CHANGE ERROR no Message: CES NIC {0} cannot connect to the gateway.
wn
Description: This CES-related NIC cannot connect to the gateway.
User Action: Check the network configuration of the network adapter, path
to the gateway, and gateway itself.
ces_network_connectivity_up STATE_CHANGE INFO no Message: CES NIC {0} can connect to the gateway and responds to the sent
connections-checking packets.
Cause: N/A
User Action: Enable this network adapter or check for problems in system
logs.
ces_network_ips_down STATE_CHANGE WARNING no Message: No CES IPs were assigned to this node.
Cause: No network adapters have the CES-relevant IPs, which makes the
node unavailable for the CES clients.
User Action: If CES is FAILED, then analyse the reason. If there are not
enough IPs in the CES pool for this node, then extend the pool.
ces_network_ips_not_assign STATE_CHANGE ERROR FTDC Message: No NICs are set up for CES.
able uploa
d Description: No network adapters are properly configured for CES.
Cause: There are no network adapters with a static IP, matching any of the
IPs from the CES pool.
User Action: Setup the static IPs and netmasks of the CES NICs in the
network interface configuration scripts, or add new matching CES IPs to the
pool. The static IPs must not be aliased.
User Action: Run the mmces address add command to add CES IP
addresses. Check the group membership of IP addresses and nodes.
ces_network_ips_up STATE_CHANGE INFO no Message: CES-relevant IPs are served by found NICs.
Cause: N/A
ces_network_link_down STATE_CHANGE ERROR no Message: Physical link of the CES NIC {0} is down.
Cause: The LOWER_UP flag is not set for this NIC in the output of ip a.
ces_network_link_up STATE_CHANGE INFO no Message: Physical link of the CES NIC {0} is up.
Cause: N/A
Description: CES monitor daemon could not run ces network monitor
successfully.
Cause: N/A
Cause: N/A
ces_no_tx_errors STATE_CHANGE INFO no Message: CES NIC {0} had no or a tiny number of TX errors.
Cause: N/A
dir_sharedroot_perm_ok STATE_CHANGE INFO no Message: The permissions of the sharedroot directory are correct: {0}.
Cause: N/A
dir_sharedroot_perm_proble STATE_CHANGE WARNING no Message: The permissions of the sharedroot directory are not sufficient
m {0}.
Cause: The cesSharedRoot directory did not have 'rx' permissions for
'group' and 'others'.
User Action: Provide 'rx' permissions for 'group' and 'others' for the
cesSharedRoot directory.
handle_network_problem_inf INFO_EXTERNAL INFO no Message: Handle network problem - Problem: {0}, Argument: {1}.
o
Description: Information about network-related reconfigurations. For
example, enable or disable IPs, and assign or unassign IPs.
move_cesip_from INFO_EXTERNAL INFO no Message: Address {0} was moved from this node to {1}.
move_cesip_to INFO_EXTERNAL INFO no Message: Address {0} was moved from {1} to this node.
move_cesips_info INFO_EXTERNAL INFO no Message: A move request for ip addresses was executed. Reason {0}.
ces_ips__warn INFO WARNING no Message: The IBM Storage Scale CES IP assignment monitor cannot be
executed, which can be a timeout issue.
ces_ips_all_unassigned STATE_CHANGE ERROR no Message: All {0} declared CES IPs are unassigned.
ces_ips_assigned STATE_CHANGE INFO no Message: All {0} expected CES IPs are assigned.
ces_ips_unassigned STATE_CHANGE WARNING no Message: {0} of {1} declared CES IPs are unassigned.
cluster_state_manager_resen INFO INFO no Message: The CSM requests resending all information.
d
Description: The CSM requests resending all information.
cluster_state_manager_reset INFO INFO no Message: Clear memory of cluster state manager for this node.
Description: A reset request for the monitor state manager was received.
Cause: A reset request for the monitor state manager was received.
component_state_change INFO INFO no Message: The state of component {0} changed to {1}.
entity_state_change INFO INFO no Message: The state of {0} {1} of the component {2} changed to {3}.
eventlog_cleared INFO INFO no Message: On the node {0}, the eventlog was cleared.
Description: The user cleared the eventlog with the mmhealth node
eventlog --clearDB command. This command also clears the events
of the mmces events list command.
heartbeat_missing STATE_CHANGE ERROR no Message: CSM is missing a heartbeat from the node {0}.
Cause: The specified cluster node did not send a heartbeat to the Cluster
State Manager (CSM).
heartbeat_missing_server_un STATE_CHANGE ERROR no Message: CSM is missing a heartbeat from node {0}, which might be due to
reachable the node, the network, and the processes being down.
node_state_change INFO INFO no Message: The state of this node is changed to {0}.
User Action: Run the mmces node resume command to stop the node
from being suspended.
service_added INFO INFO no Message: On the node {0}, the {1} monitor was started.
service_no_pod_data STATE_CHANGE WARNING no Message: A request to {id} did not yield expected health data.
Cause: The service is running in a different POD and does not respond to
requests that are regarding its health state.
User Action: Check that all pods are running in the container environment.
The event can be manually cleared by using the mmhealth event
resolve service_no_pod_data <id> command.
service_pod_data STATE_CHANGE INFO no Message: The request to {id} did return health data as expected.
service_removed INFO INFO no Message: On the node {0} the {1} monitor was removed.
service_reset STATE_CHANGE INFO no Message: The service {0} on node {1} was reconfigured, and its events were
cleared.
service_running STATE_CHANGE INFO no Message: The service {0} is running on node {1}.
service_stopped STATE_CHANGE INFO no Message: The service {0} is stopped on node {1}.
singleton_sensor_off INFO INFO no Message: The singleton sensors of pmsensors are turned off.
singleton_sensor_on INFO INFO no Message: The singleton sensors of pmsensors are turned on.
webhook_url_abort INFO WARNING no Message: Webhook URL {0} was disabled because a fatal runtime
error was encountered. For more information, see the monitoring logs
in /var/adm/ras/mmsysmonitor.log.
User Action: Check that the webhook URL is reachable and re-enable the
URL by using the mmhealth config webhook add command.
webhook_url_communication INFO INFO no Message: Webhook URL {0} was not able to receive event information.
Description: The system health framework was not able to send event
information to a configured webhook URL.
Cause: The system health framework was not able to send event
information.
webhook_url_disabled INFO WARNING no Message: Webhook URL {0} was disabled as too many failures occurred.
User Action: Check that the webhook URL is reachable and re-enable the
URL by using the mmhealth config webhook add command.
webhook_url_reset INFO INFO no Message: Webhook URL {0} communication was set back to a HEALTHY
state.
Description: The system health framework set this webhook URL status
back to a HEALTHY state after being disabled because of repeated failures.
Cause: The system health framework set this webhook URL status back to
a HEALTHY state.
webhook_url_restored INFO INFO no Message: Webhook URL {0} communication was restored and event
information was successfully sent.
Cause: The system health framework was able to send event information
to the webhook URL.
webhook_url_ssl_validation INFO WARNING no Message: Communication to webhook URL {} was established, but Server-
Side certificate validation failed and was disabled. Check the HTTPS server
configuration to ensure that this disabling is the intended behavior.
User Action: Check that the webhook URL has a valid SSL certificate
and re- enable the URL by using the mmhealth config webhook add
command.
User Action: If the recovering state is unexpected, then refer to the Disk
issues section in the IBM Storage Scale: Problem Determination Guide.
Cause: A disk is in unrecovered state. The metadata scan might have failed.
User Action: If the unrecovered state is unexpected, then refer to the Disk
issues section in the IBM Storage Scale: Problem Determination Guide.
User Action: If the down state is unexpected, then refer to the Disk issues
section in the IBM Storage Scale: Problem Determination Guide. The failed
disk might be a descriptor disk.
disk_down_change STATE_CHANGE INFO no Message: Disk {0} is reported as down because the configuration changed.
FS={1}, reason code={2}.
Cause: An IBM Storage Scale callback event reported that a disk is in the
down state because the configuration was changed.
User Action: If the down state is unexpected, then see the Disk issues
section in the IBM Storage Scale: Problem Determination Guide.
disk_down_del STATE_CHANGE INFO no Message: Disk {0} is reported as down because it was deleted. FS={1},
reason code={2}.
Cause: An IBM Storage Scale callback event reported that a disk is in the
down state because it was deleted.
User Action: If the down state is unexpected, then see the Disk issues
section in the IBM Storage Scale: Problem Determination Guide.
disk_down_io STATE_CHANGE ERROR no Message: Disk {0} is reported as down because of an I/O issue. FS={1},
reason code={2}.
Cause: An IBM Storage Scale callback event reported that a disk is in the
down state because of an I/O issue.
User Action: If the down state is unexpected, then see the Disk issues
section in the IBM Storage Scale: Problem Determination Guide.
disk_down_rpl STATE_CHANGE INFO no Message: Disk {0} is reported as down because it was replaced. FS={1},
reason code={2}.
Cause: An IBM Storage Scale callback event reported that a disk is in the
down state because it was replaced.
User Action: If the down state is unexpected, then see the Disk issues
section in the IBM Storage Scale: Problem Determination Guide.
disk_down_unexpected STATE_CHANGE ERROR no Message: Disk {0} is reported as unexpected down. FS={1},
reasoncode={2}.
Cause: An IBM Storage Scale callback event reported a disk in down state
of unexpected reason.
User Action: If the down state is unexpected, then refer to the Disk issues
section in the IBM Storage Scale: Problem Determination Guide.
disk_down_unknown STATE_CHANGE WARNING no Message: Disk {0} is reported as down for unknown reason. FS={1},
reasoncode={2}.
Description: A disk is reported as down for unknown reason. The disk was
probably stopped.
Cause: An IBM Storage Scale callback event reported a disk in down state
of unknown reason. Probably the disk was stopped or suspended.
User Action: If the down state is unexpected, then refer to the 'Disk issues
section in the IBM Storage Scale: Problem Determination Guide.
disk_failed_cb INFO_EXTERNAL INFO no Message: Disk {0} is reported as failed. FS={1}. Affected NSD servers are
notified about the disk_down state.
User Action: If the failure state is unexpected, then see the Disk issues
section in the IBM Storage Scale: Problem Determination Guide.
disk_fs_desc_missing STATE_CHANGE INFO no Message: Device {0} has no desc disks assigned in failure group(s) {1}.
disk_fs_desc_ok STATE_CHANGE INFO no Message: Device {0} descriptor disks identified for all failure groups.
Description: GPFS device has descriptor disks identified for all failure
groups as reported by the mmlsdisk command.
Cause: GPFS device has descriptor disks for all failure groups.
disk_io_err_cb STATE_CHANGE ERROR no Message: Disk {0} is reported as I/O error. node={1}.
Cause: An IBM Storage Scale callback event reported a disk I/O error.
User Action: For more information, see the Disk issues section in the IBM
Storage Scale: Problem Determination Guide.
Cause: A disk is not in use for an IBM Storage Scale file system, which can
be a valid situation.
Enclosure events
The following table lists the events that are created for the Enclosure component.
Table 79. Events for the enclosure component
adapter_bios_notavail STATE_CHANGE WARNING no Message: The BIOS level of adapter {0} is not available.
Description: The BIOS level of the adapter is not available. A BIOS update
might solve the problem.
adapter_bios_ok STATE_CHANGE INFO no Message: The BIOS level of adapter {0} is correct.
Cause: N/A
adapter_bios_wrong STATE_CHANGE WARNING no Message: The BIOS level of adapter {0} is wrong.
Description: The BIOS level of the adapter is not correct. The BIOS
firmware needs an update.
adapter_firmware_notavail STATE_CHANGE WARNING no Message: The firmware level of adapter {0} is not available.
adapter_firmware_ok STATE_CHANGE INFO no Message: The firmware level of adapter {0} is correct.
Cause: N/A
adapter_firmware_wrong STATE_CHANGE WARNING no Message: The firmware level of adapter {0} is wrong.
Description: The firmware level of the adapter is not correct. The adapter
firmware needs an update.
User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system. Follow the maintenance
procedures for updating the adapter firmware.
current_failed STATE_CHANGE ERROR no Message: Current sensor {0} measured wrong current.
Cause: N/A
current_warn STATE_CHANGE WARNING no Message: Current sensor {0} might be facing an issue.
User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.
User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.
User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.
Cause: N/A
User Action: Verify that all DIMMs in all canisters have the same
specification (size, speed, etc). The event can be manually cleared by using
the mmhealth event resolve dimm_config_mismatch command.
Cause: N/A
User Action: Install the enclosure door. Verify the door state by using the
mmlsenclosure all -L command. For more help, contact IBM support.
User Action: Close the enclosure door. Verify the door state by using the
mmlsenclosure all -L command.
Cause: N/A
User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.
Cause: N/A
drive_firmware_notavail STATE_CHANGE WARNING no Message: The firmware level of drive {0} is not available.
User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system. Follow the maintenance
procedures for updating the drive firmware. If the issue persists, then
contact IBM support.
drive_firmware_ok STATE_CHANGE INFO no Message: The firmware level of drive {0} is correct.
Cause: N/A
drive_firmware_wrong STATE_CHANGE WARNING no Message: The firmware level of drive {0} is wrong.
Description: The firmware level of the drive is not correct. The drive
firmware needs an update.
User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system. Follow the maintenance
procedures for updating the drive firmware. If the issue persists, then
contact IBM support.
Cause: N/A
enclosure_firmware_notavail STATE_CHANGE WARNING no Message: The firmware level of enclosure {0} is not available.
enclosure_firmware_ok STATE_CHANGE INFO no Message: The firmware level of enclosure {0} is correct.
Cause: N/A
enclosure_firmware_unknown STATE_CHANGE WARNING no Message: The firmware level of enclosure {0} is unknown.
enclosure_firmware_wrong STATE_CHANGE WARNING no Message: The firmware level of enclosure {0} is wrong.
Cause: N/A
Cause: N/A
Cause: A GNR enclosure, which was previously listed in the IBM Storage
Scale configuration, is no longer found.
User Action: Check whether the ESM is installed and operational. For more
information, see the IBM Storage Scale: Problem Determination Guide of the
relevant system.
User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.
Cause: N/A
User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.
Cause: N/A
User Action: Check the enclosure. Insert or replace fan. If the problem
remains, then contact IBM support.
User Action: Replace the fan. Contact IBM support for a service action.
Cause: N/A
fan_speed_high STATE_CHANGE WARNING servic Message: Fan {0} speed is too high.
e
ticket Description: Fan speed is out of tolerance because it is too high.
User Action: For more information, check the enclosure cooling module
LEDs for fan faults.
fan_speed_low STATE_CHANGE WARNING servic Message: Fan {0} speed is too low.
e
ticket Description: Fan speed is out of tolerance because it is too low.
User Action: For more information, check the enclosure cooling module
LEDs for fan faults.
no_enclosure_data STATE_CHANGE WARNING no Message: Enclosure data and enclosure state information cannot be
queried.
power_high_current STATE_CHANGE WARNING servic Message: Power supply {0} reports high current.
e
ticket Description: The DC power supply current is greater than the threshold.
power_high_voltage STATE_CHANGE WARNING servic Message: Power supply {0} reports high voltage.
e
ticket Description: The DC power supply voltage is greater than the threshold.
Cause: The hardware monitor reports that power is not being supplied to
the power supply.
User Action: Check whether the power supply is installed and operational.
For more information, see the IBM Storage Scale: Problem Determination
Guide of the relevant system.
power_supply_config_mismat STATE_CHANGE_EXTE ERROR servic Message: Enclosure has an inconsistent power supply configuration.
ch RNAL e
ticket Description: Inconsistent power supply configuration.
Cause: The power supplies in the enclosure do not fit to each other.
User Action: Verify that all power supplies in all canisters have the same
specification. The event can be manually cleared by using the mmhealth
event resolve power_supply_config_mismatch command.
power_supply_config_ok STATE_CHANGE_EXTE INFO no Message: Enclosure has a correct power supply configuration.
RNAL
Description: The power supply configuration is OK.
Cause: N/A
Cause: The hardware monitor reports that a power supply has failed.
User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.
Cause: The hardware monitor reports that the power supply is turned off.
User Action: Make sure that the power supply continues to get power, such
as power cable is plugged-in. However, if the problem persists, see IBM
Storage Scale: Problem Determination Guide of the relevant system.
Cause: N/A
Cause: The hardware monitor reports that a power supply is switched off.
The requested on-bit is off, which means that the power supply is not
manually switched on or is missing by setting the requested on-bit.
User Action: For more information, see the IBM Storage Scale: Problem
Determination Guide of the relevant system.
Cause: N/A
temp_bus_failed STATE_CHANGE WARNING servic Message: Temperature sensor {0} I2C bus has failed.
e
ticket Description: Temperature sensor I2C bus has failed.
temp_high_critical STATE_CHANGE WARNING no Message: Temperature sensor {0} measured a high temperature value.
temp_high_warn STATE_CHANGE WARNING no Message: Temperature sensor {0} has measured a high temperature value.
temp_low_critical STATE_CHANGE WARNING no Message: Temperature sensor {0} has measured a temperature is less than
the low critical value.
temp_low_warn STATE_CHANGE WARNING no Message: Temperature sensor {0} has measured temperature is less than
the low warning value.
temp_sensor_failed STATE_CHANGE WARNING servic Message: Temperature sensor {0} has failed.
e
ticket Description: A temperature sensor might be broken.
Cause: N/A
voltage_bus_failed STATE_CHANGE WARNING servic Message: Voltage sensor {0} communication with the I2C bus has failed.
e
ticket Description: The voltage sensor cannot communicate with the I2C bus.
voltage_high_critical STATE_CHANGE WARNING no Message: Voltage sensor {0} measured a high voltage value.
Description: The voltage has exceeded the actual high critical threshold for
at least one sensor.
voltage_high_warn STATE_CHANGE WARNING no Message: Voltage sensor {0} has measured a high voltage value.
Description: The voltage has exceeded the actual high warning threshold
for at least one sensor.
voltage_low_critical STATE_CHANGE WARNING no Message: Voltage sensor {0} has measured a critical low voltage value.
Description: The voltage has fallen under the lower critical threshold.
voltage_low_warn STATE_CHANGE WARNING no Message: Voltage sensor {0} has measured a low voltage value.
Description: The voltage has fallen under the lower warning threshold.
Cause: N/A
Encryption events
The following table lists the events that are created for the Encryption component.
Table 80. Events for the Encryption component
encryption_configured INFO_ADD_ENTITY INFO no Message: New encryption provider for {id} is configured.
Cause: N/A
Cause: N/A
rkmconf_backend_err STATE_CHANGE ERROR no Message: RKM backend server {0} returned an unrecoverable error {1}.
User Action: Ensure that the specification of the backend key management
server in the RKM instance is correct and the key server is running on the
specified host. The event can be manually cleared by using the mmhealth
event resolve rkmconf_backend_err <event id> command.
rkmconf_backenddown_err STATE_CHANGE ERROR no Message: The RKM backend server {0} cannot be reached.
User Action: Ensure that the specification of the backend key management
server in the RKM instance is correct and the key server is running
on the specified host. The event can be manually cleared by using
the mmhealth event resolve rkmconf_backenddown_err <event
id> command.
Cause: The client or server certificate for the key server expired.
User Action: Follow the documented procedure to update the key server
and/or RKM configuration with a new client or server certificate. The
event can be manually cleared by using the mmhealth event resolve
rkmconf_certexp_err command.
Cause: N/A
Cause: The client or server certificate for the key server approaches its
expiration time.
User Action: Follow the documented procedure to update the key server
and/or RKM configuration with a new client or server certificate. The
event can be manually cleared by using the mmhealth event resolve
rkmconf_ccertexp_warn command.
rkmconf_certwarn_ok STATE_CHANGE INFO no Message: No certificates that are approaching the expiration time are
encountered.
Cause: N/A
User Action: Ensure that the content of the RKM configuration file
conforms with the documented format (regular setup), or that the
arguments that are provided to the mmkeyserv command conform to the
documentation (simplified setup). The event can be manually cleared by
using the mmhealth event resolve rkmconf_configuration_err
command.
Cause: N/A
rkmconf_filenotfound_err STATE_CHANGE ERROR no Message: The mmfsd daemon is not able to read the RKM configuration
file.
Cause: The file does not exist or its content is not valid.
rkmconf_fileopen_err STATE_CHANGE ERROR no Message: Cannot open RKM configuration file for reading {0}.
Cause: The RKM configuration file exists but cannot be opened for reading.
User Action: Check that, as root, you can open the RKM configuration
file with a text editor. The event can be manually cleared by using the
mmhealth event resolve rkmconf_fileopen_err command.
rkmconf_fileread_err STATE_CHANGE ERROR no Message: Cannot read RKM configuration file {0}.
User Action: Check that, as root, you can open the RKM configuration
file with a text editor. The event can be manually cleared by using the
mmhealth event resolve rkmconf_fileread_err command.
rkmconf_getkey_err STATE_CHANGE ERROR no Message: MEK {0} is not available from RKM backend server {1}.
Cause: Failed to retrieve the MEK from the RKM backend servers.
User Action: Ensure that the MEK specified by the UUID provided is
available from the RKM specified by using the mmkeyserv key show
command. The event can be manually cleared by using the mmhealth
event resolve rkmconf_getkey_err <event id> command.
Cause: The RKM instance configuration is not correct. One of the attributes
is not valid or out of range.
Cause: The keystore file for the key management server is not accessible
or its content is not valid, or the ownership and/or permissions are too
permissive.
User Action: Ensure that the content of the keystore file conforms with
the documented format and that only root can read and write the file. The
event can be manually cleared by using the mmhealth event resolve
rkmconf_keystore_err command.
rkmconf_ok STATE_CHANGE INFO no Message: The RKM backend configuration is correct and working as
expected.
Cause: N/A
rkmconf_permission_err STATE_CHANGE ERROR no Message: Incorrect ownership and/or file system permissions for RKM
configuration file {0}.
Cause: The RKM configuration file was created with incorrect file system
permissions.
Description:
Cause:
User Action:
rkm_duplicate STATE_CHANGE ERROR no Message: RKM.conf contains duplicate RKM IDs {id}.
User Action: Verify that the rkmid is unique in all the stanzas in all the
RKM.conf files.
Cause: The client or server certificate for the key server expired.
User Action: Follow the documented procedure to update the key server
and/or RKM configuration with a new client or server certificate.
rkm_keyring STATE_CHANGE ERROR no Message: Could not open keyring file: {id}.
Description: The RKM client is not able to open the keyring file.
Cause: The RKM client is not able to open the keyring file.
User Action: Ensure that the content of the keystore file conforms with the
documented format and that only root can read and write the file.
User Action: Verify that a key with the specified label exists in the client
keystore.
Cause: N/A
User Action: Verify that the passphrase for the client keystore is correct.
auditc_auditlogfile STATE_CHANGE ERROR no Message: Unable to open or append to the auditLog {1} files for file system
{0}.
Cause: N/A
User Action: Check whether the audited file system is mounted on the
node.
auditc_auth_failed STATE_CHANGE ERROR no Message: Authentication error encountered in audit consumer for group {1}
for file system {0}.
Cause: N/A
auditc_brokerconnect STATE_CHANGE ERROR no Message: Unable to connect to Kafka broker server {1} for file system {0}.
Cause: N/A
auditc_compress STATE_CHANGE WARNING no Message: Could not compress for audit log file {1}.
Cause: N/A
auditc_createkafkahandle STATE_CHANGE ERROR no Message: Failed to create audit consumer Kafka handle for file system {0}.
Cause: N/A
auditc_err STATE_CHANGE ERROR no Message: Error encountered in audit consumer for file system {0}.
Cause: N/A
auditc_flush_auditlogfile STATE_CHANGE ERROR no Message: Unable to flush the auditLog {1} files for file system {0}.
Cause: N/A
User Action: Check whether the file system is mounted on the node.
auditc_flush_errlogfile STATE_CHANGE ERROR no Message: Unable to flush the errorLog file for file system {0}.
Cause: N/A
User Action: Check whether the file system is mounted on the node.
auditc_found INFO_ADD_ENTITY INFO no Message: Audit consumer for file system {0} was found.
Cause: N/A
auditc_initlockauditfile STATE_CHANGE ERROR no Message: Failed to indicate to systemctl on successful consumer startup
sequence for file system {0}.
Cause: N/A
auditc_loadkafkalib STATE_CHANGE ERROR no Message: Unable to initialize file audit consumer for file system {0}. Failed
to load librdkafka library.
Cause: N/A
User Action: Check the installation of librdkafka libraries and retry the
mmaudit command.
auditc_mmauditlog STATE_CHANGE ERROR no Message: Unable to append to file {1} for file system {0}.
Cause: N/A
User Action: Check that the audited file system is mounted on the node.
Ensure the file system to be audited is in a HEALTHY state and then, retry
by using the mmaudit disable/enable command.
Cause: N/A
auditc_offsetfetch STATE_CHANGE ERROR no Message: Failed to fetch topic ({1}) offset for file system {0}.
Cause: N/A
auditc_offsetstore STATE_CHANGE ERROR no Message: Failed to store an offset for file system {0}.
Cause: N/A
auditc_ok STATE_CHANGE INFO no Message: File Audit consumer for file system {0} is running.
Cause: N/A
auditc_service_failed STATE_CHANGE ERROR no Message: File audit consumer {1} for file system {0} is not running.
Cause: N/A
auditc_service_ok STATE_CHANGE INFO no Message: File audit consumer service for file system {0} is running.
Cause: N/A
auditc_setconfig STATE_CHANGE ERROR no Message: Failed to set configuration for audit consumer for file system {0}
and group {1}.
Cause: N/A
auditc_setimmutablity STATE_CHANGE WARNING no Message: Could not set immutability on for auditLogFile {1}.
Cause: N/A
auditc_topicsubscription STATE_CHANGE ERROR no Message: Failed to subscribe to topic ({1}) for file system {0}.
Cause: N/A
auditc_vanished INFO_DELETE_ENTITY INFO no Message: Audit consumer for file system {0} has vanished.
Description: An audit consumer that was listed in the IBM Storage Scale
configuration was removed.
Cause: N/A
auditc_warn STATE_CHANGE WARNING no Message: Warning encountered in audit consumer for file system {0}.
Cause: N/A
auditp_auth_err STATE_CHANGE ERROR no Message: Error obtaining authentication credentials or configuration for
producer; error message: {2}.
Cause: N/A
User Action: Verify that the file audit log is properly configured. Disable
and enable the file audit log by using the mmmsgqueue and the mmaudit
commands.
Cause: N/A
auditp_auth_warn STATE_CHANGE WARNING no Message: Authentication credentials for Kafka could not be obtained. An
attempt to update credentials is performed later. Message: {2}.
Cause: N/A
auditp_create_err STATE_CHANGE ERROR no Message: Error encountered while creating a new (loading or configuring)
event producer; error message: {2}.
Cause: N/A
User Action: Verify that the correct gpfs.librdkafka is installed. For more
information, check /var/adm/ras/mmfs.log.latest.
auditp_found INFO_ADD_ENTITY INFO no Message: New event producer for file system {2} was configured.
Cause: N/A
auditp_log_err STATE_CHANGE ERROR no Message: Error opening or writing to event producer log file.
Cause: N/A
auditp_msg_send_err STATE_CHANGE WARNING no Message: Failed to send message to target sink for file system {2};
errormessage: {3}.
Cause: N/A
User Action: Check connectivity to Kafka broker and topic and check
whether the broker can accept new messages. For more information,
check the /var/adm/ras/mmfs.log.latest and /var/adm/ras/
mmmsgqueue.log files.
auditp_msg_send_stop STATE_CHANGE ERROR no Message: Failed to send more than {2} messages to target sink. Producer is
now shutdown. No more messages are sent.
Cause: N/A
User Action: To re-enable events, and disable and re-enable file audit
logging, run the mmaudit <device> disable/enable command. If file
audit logging fails again, then you might need to disable and re-enable
message queue. Run the mmmsgqueue enable/disable command, and
then enable the file audit logging. If file audit logging continues to fail,
then run the mmmsgqueue config --remove command. Now, enable
the message queue and then enable the file audit logging.
auditp_msg_write_err STATE_CHANGE WARNING no Message: Failed to write message to Audit log for file system {2}; error
message: {3}.
Cause: N/A
User Action: Ensure that the Audit fileset is healthy. For more information,
check the /var/adm/ras/mmfs.log.latest and /var/adm/ras/
mmmsgqueue.log.
auditp_msg_write_stop STATE_CHANGE ERROR no Message: Failed to write more than {2} messages to Audit log. Producer is
now shutdown. No more messages are sent.
Cause: N/A
auditp_msgq_unsupported STATE_CHANGE ERROR no Message: Message queue is no longer supported and no clustered watch
folder or file audit logging commands can be run until the message queue is
removed.
Cause: N/A
auditp_ok STATE_CHANGE INFO no Message: Event producer for file system {2} is OK.
Cause: N/A
auditp_vanished INFO_DELETE_ENTITY INFO no Message: An event producer for file system {2} was removed.
Cause: N/A
clear_mountpoint_tip TIP INFO no Message:File system {0} was unmounted or mounted at its default
mountpoint.
Description: Clear any previous tip for this file system about using a non-
default mountpoint.
Cause: N/A
desc_disk_quorum_fail STATE_CHANGE WARNING no Message: Sufficient healthy descriptor disks are not found for file system
{0} quorum.
User Action: Check the health state of disks, which are declared as
descriptor disks for the file system. An insufficient number of healthy
descriptor disks might lead to a data access loss. For more information, see
the 'Disk issues' section in the IBM Storage Scale: Problem Determination
Guide.
desc_disk_quorum_ok STATE_CHANGE INFO no Message: Sufficient healthy descriptor disks are found for file system {0}
quorum.
Cause: N/A
exported_fs_available STATE_CHANGE INFO no Message: The file system {0} used for exports is available.
Cause: N/A
exported_path_available STATE_CHANGE INFO no Message: All NFS or SMB exported path with undeclared mount points are
available.
Description: All NFS or SMB exported paths are available, which may
include automounted folders..
Cause: N/A
exported_path_unavail STATE_CHANGE WARNING no Message: At least for one NFS or SMB export ({0}), no GPFS file system is
mounted to the exported path.
Description: At least for one NFS or SMB exported path the intended file
system is unclear or unmounted. Those exports cannot be used and can
lead to a failure of the service.
Cause: At least one NFS or SMB exported path does not point to a mounted
GPFS file system according to /proc/mounts. The intended file system is
unknown because the export path does not match the default mountpoint
of any GPFS file system due to the use autofs or bind-mount.
User Action: Check the mount states for NFS and SMB exported file
systems. This message can be related to autofs or bind-mounted file
systems.
Description: A file system listed in the IBM Storage Scale configuration was
detected.
Cause: N/A
Description: A file system listed in the IBM Storage Scale configuration was
not detected.
User Action: To verify that all expected file systems are mounted, run the
mmlsmount all_local -L command.
fs_forced_unmount STATE_CHANGE ERROR no Message: The file system {0} was {1} (forced) to unmount.
User Action: Check error messages and error log for further details. For
more information, see the 'File system forced unmount' topic in the IBM
Storage Scale: Problem Determination Guide.
fs_maintenance_mode STATE_CHANGE INFO no Message: File system {id} is set to maintenance mode.
Cause: N/A
fs_preunmount_panic STATE_CHANGE ERROR no Message: The file system {0} is unmounted because of an SGPanic
situation.
User Action: For more information, check error messages and error log in
the mmfs.log file.
fs_remount_mount STATE_CHANGE_EXTE INFO no Message: The file system {0} was mounted {1}.
RNAL
Description: A file system was mounted.
Cause: N/A
fs_unmount_info INFO_EXTERNAL INFO no Message: The file system {0} was unmounted {1}.
fs_working_mode STATE_CHANGE INFO no Message: File system {id} is not in maintenance mode.
Cause: N/A
User Action: Check the error message and the mmfs.log.latest log for
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message for details and the mmfs.log.latest log
for further details. For more information, see the Checking and repairing
a file system topic in the IBM Storage Scale: Administration Guide. If the
file system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
systemtopic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
User Action: Check error message and the mmfs.log.latest log for further
details. For more information, see the Checking and repairing a file
system topic in the IBM Storage Scale: Administration Guide. If the file
system is severely damaged, then follow the steps mentioned in the
MMFS_FSSTRUCT section in the IBM Storage Scale: Problem Determination
Guide.
fsstruct_fixed STATE_CHANGE INFO no Message: A file system {id} structure error has been marked as fixed.
Cause: N/A
ill_exposed_fs STATE_CHANGE WARNING no Message: The file system {0} has a data exposure risk as there are file(s)
where all replicas are on suspended disks, which makes it vulnerable to
potential data loss when a disk fails.
User Action: Run the mmrestripefs command against the file system.
ill_replicated_fs STATE_CHANGE WARNING no Message: The file system {0} is not properly replicated.
User Action: Run the mmrestripefs command against the file system.
ill_unbalanced_fs TIP TIP no Message: The file system {0} is not properly balanced.
User Action: Run the mmrestripefs command against the file system.
inode_high_error STATE_CHANGE ERROR no Message: The inode usage of fileset {id[1]} in file system {id[0]} reached a
nearly exhausted level. {0}.
Cause: The inode usage in the fileset reached a nearly exhausted level.
inode_high_warn STATE_CHANGE WARNING no Message: The inode usage of fileset {id[1]} in file system {id[0]} has
reached a warning level. {0}.
Description: The inode usage in the fileset has reached warning level.
Cause: The inode usage in the fileset has reached warning level.
inode_no_data STATE_CHANGE INFO no Message: No inode usage data is used for fileset {id[1]} in file system
{id[0]}.
Cause: N/A
inode_normal STATE_CHANGE INFO no Message: The inode usage of fileset {id[1]} in file system {id[0]} reached a
normal level.
Cause: N/A
inode_removed STATE_CHANGE INFO no Message: No inode usage data is used for fileset {id[1]} in file system
{id[0]}.
Cause: N/A
local_exported_fs_unavail STATE_CHANGE ERROR no Message: The local file system {0} that is used for exports is not mounted.
Description: A local file system that is used for export is not mounted.
low_disk_space_info INFO INFO no Message: Low disk space. StoragePool {1} in file system {0} has reached
the threshold as configured in a migration policy.
User Action: For more information, check warning message and the
mmfs.log.latest log. Check also the Using thresholds to migrate data
between pools section in the IBM Storage Scale: Administration Guide.
low_disk_space_warn STATE_CHANGE WARNING no Message: Low disk space. File system {0} has reached its high occupancy
threshold. StoragePool={1}.
Description: Low disk space. A file system has reached its high occupancy
threshold.
Cause: Low disk space. A file system has reached its high occupancy
threshold.
User Action: For more information, check the warning message and the
mmfs.log.latest log. Clear this event by using the mmhealth event
resolve low_disk_space_warn <fsname> command.
Cause: N/A
no_disk_space_clear STATE_CHANGE INFO no Message: A disk space warning has been marked as fixed. A disk space
issue was resolved.
Description: A file system low disk or inode space warning was declared or
detected as resolved.
Cause: N/A
no_disk_space_inode STATE_CHANGE ERROR no Message: Fileset {2} runs out of space. Filesystem={0}, StoragePool={1},
reason={3}.
Cause: A fileset does not have sufficient inode space. Triggered by the
'noDiskSpace' callback.
User Action: For more information, check the error message and the
mmfs.log.latest log. Clear this event by using the mmhealth event
resolve no_disk_space_inode <fsname> command.
no_disk_space_warn STATE_CHANGE ERROR no Message: File system {0} runs out of space. StoragePool={1}, FSet={2},
reason={3}.
Cause: A file system does not have sufficient disk space. Triggered by the
'noDiskSpace' callback.
User Action: For more information, check the error message and the
mmfs.log.latest log. Clear this event by using the mmhealth event
resolve no_disk_space_warn <fsname> command.
not_default_mountpoint TIP TIP no Message: The mountpoint for file system {0} differs from the declared
default value.
ok_exposed_fs STATE_CHANGE INFO no Message: The file system {0} has no data exposure risk.
Cause: N/A
ok_replicated_fs STATE_CHANGE INFO no Message: The file system {0} is properly replicated.
Cause: N/A
ok_unbalanced_fs STATE_CHANGE INFO no Message: The file system {0} is properly balanced.
Cause: N/A
pool-data_high_error STATE_CHANGE ERROR no Message: The pool {id[1]} of file system {id[0]} has reached a nearly
exhausted data level. {0}.
User Action: For more information, add more capacity to pool or move data
to different pool or delete data and/or snapshots.
pool-data_high_warn STATE_CHANGE WARNING no Message: The pool {id[1]} of file system {id[0]} has reached a warning level
for data. {0}.
User Action: For more information, add more capacity to pool or move data
to different pool or delete data and/or snapshots.
pool-data_no_data STATE_CHANGE INFO no Message: No usage data for pool {id[1]} in file system {id[0]}.
Cause: N/A
pool-data_normal STATE_CHANGE INFO no Message: The pool {id[1]} of file system {id[0]} has reached a normal data
level.
pool-data_removed STATE_CHANGE INFO no Message: No usage data for pool {id[1]} in file system {id[0]}.
Cause: N/A
pool-metadata_high_error STATE_CHANGE ERROR no Message: The pool {id[1]} of file system {id[0]} has reached a nearly
exhausted metadata level. {0}.
Cause: N/A
User Action: Add more capacity to pool or move data to different pool or
delete data and/or snapshots.
pool-metadata_high_warn STATE_CHANGE WARNING no Message: The pool {id[1]} of file system {id[0]} has reached a warning level
for metadata. {0}.
User Action: Add more capacity to pool or move data to different pool or
delete data and/or snapshots.
pool-metadata_no_data STATE_CHANGE INFO no Message: No usage data for pool {id[1]} in file system {id[0]}.
Cause: N/A
pool-metadata_normal STATE_CHANGE INFO no Message: The pool {id[1]} of file system {id[0]} has reached a normal
metadata level.
Cause: N/A
pool-metadata_removed STATE_CHANGE INFO no Message: No usage data for pool {id[1]} in file system {id[0]}.
Cause: N/A
pool_high_error STATE_CHANGE ERROR no Message: The pool {id[1]} of file system {id[0]} has reached a nearly
exhausted level. {0}.
User Action: Add more capacity to pool or move data to different pool or
delete data and/or snapshots.
pool_high_warn STATE_CHANGE WARNING no Message: The pool {id[1]} of file system {id[0]} reached a warning level. {0}.
User Action: Add more capacity to pool or move data to different pool or
delete data and/or snapshots.
pool_no_data INFO INFO no Message: The state of pool {id[1]} in file system {id[0]} is unknown.
pool_normal STATE_CHANGE INFO no Message: The pool {id[1]} of file system {id[0]} has reached a normal level.
Cause: N/A
remote_exported_fs_unavail STATE_CHANGE ERROR no Message: The remote file system {0} that is used for exports is not
mounted.
Description: A remote file system that is used for export is not mounted.
User Action: Check the remote mount states and exports of the remote
cluster.
User Action: To verify that all expected file systems are mounted, run the
mmlsmount all_local -L command.
unmounted_fs_check STATE_CHANGE WARNING no Message: The file system {0} is probably needed, but not mounted.
User Action: To verify that all expected file systems are mounted, run the
mmlsmount all_local -L command.
unmounted_fs_ok STATE_CHANGE INFO no Message: The file system {0} is probably needed, but not automounted or
automount is prevented.
Cause: N/A
filesystem_mgr INFO_ADD_ENTITY INFO no Message: File system {0} is managed by this node.
Cause: N/A
filesystem_no_mgr INFO_DELETE_ENTITY INFO no Message: File system {0} no longer managed by this node.
managed_by_this_node STATE_CHANGE INFO no Message: The file system {0} is managed by this node.
qos_check_done INFO INFO no Message: The QOS check cycle did not succeed.
qos_check_ok STATE_CHANGE INFO no Message: File system {0} QOS check is OK.
qos_check_warn INFO WARNING no Message: The QOS check cycle did not succeed.
User Action: This might be a temporary issue. Check the configuration and
the mmqos command output.
qos_not_active STATE_CHANGE WARNING no Message: File system {0} QOS is enabled but throttling or monitoring is not
active.
qos_sensors_clear TIP INFO no Message: Clear any previous bad QoS sensor state.
User Action: Set the period attribute of the GPFSQoS sensor greater than
0. Therefore, use the mmperfmon config update GPFSQoS.period=N
command, where N is a number greater 0. On the other hand, you can hide
this event by using the mmhealth event hide qos_sensor_inactive
command. Enable the perfmon GPFSQoS sensor.
qos_sensors_not_configured TIP TIP no Message: The QoS perfmon sensor GPFSQoS is not configured.
Description: The QoS perfmon sensor does not exist in the mmperfmon
config show command.
qos_sensors_not_needed TIP TIP no Message: Qos is not configured but performance sensor GPFSQoS period is
declared.
qos_state_mismatch TIP TIP no Message: File system {0} has an enablement mismatch in QOS state.
Description: Mismatch between the declared QOS state and the current
state. One state is enabled and the other state is not enabled.
Cause: There is a mismatch between the declared QOS state and the
current state.
qos_version_mismatch TIP TIP no Message: File system {0} has a version mismatch in QOS state.
Cause: There is a mismatch between the declared QOS version and the
current version.
Description: The GDS test program gdscheck did not return the expected
success message.
User Action: Check the GDS configuration and settings for the
gdscheckfile parameter. For more information, see the 'gpudirect'
section of the '/var/mmfs/mmsysmon/mmsysmonitor.conf' system health
monitor configuration file.
Description: The GDS test program has failed or ran into a timeout.
User Action: Check the GDS configuration and settings for the
gdscheckfile parameter. For more information, see the 'gpudirect'
section of the '/var/mmfs/mmsysmon/mmsysmonitor.conf' file.
gds_prerequisite_bad STATE_CHANGE ERROR no Message: The GDS prerequisite check for enabled RDMA failed.
GPFS events
The following table lists the events that are created for the GPFS component.
Table 85. Events for the GPFS component
callhome_enabled TIP INFO no Message: Call home is installed, configured, and enabled.
Description: With enabling the call home functionality, you are providing
useful information to the developers, which helps to improve the product.
Cause: Call home packages are installed. Call home is configured and
enabled.
callhome_not_enabled TIP TIP no Message: Call home is not installed, configured, or enabled.
Cause: Call home packages are not installed, there is no call home
configuration, there are no call home groups, or no call home group was
enabled.
callhome_not_monitored TIP INFO no Message: Call home status is not monitored on the current node.
Description: Call home status is not monitored on the current node, but
was, when it was the cluster manager.
Cause: Previously, this node was a cluster manager, and Call Home
monitoring was running on it.
callhome_without_schedule TIP TIP no Message: Call home is enabled, but, neither daily nor weekly schedule is
configured.
Description: Call home is enabled, but, neither daily nor weekly schedule
is configured. It is recommended to enable daily or weekly call home
schedules.
Cause: Call home is enabled, but, neither daily nor weekly schedule is
configured.
User Action: Enable daily call home uploads by using the mmcallhome
schedule add --task DAILY command.
ccr_auth_keys_disabled STATE_CHANGE INFO no Message: The security file that is used by GPFS CCR is not checked on this
node.
Description: The check for the security file used by GPFS CCR is disabled
on this node, since it is not a quorum node.
Cause: N/A
ccr_auth_keys_fail STATE_CHANGE ERROR FTDC Message: The security file that is used by GPFS CCR is corrupt.
uploa Item={0},ErrMsg={1},Failed={2}.
d
Description: The security file used by GPFS CCR is corrupt. For more
information, see message.
User Action: Recover this degraded node from a still intact node by using
the mmsdrrestore -p <NODE> command with NODE specifying intact
node. For more information, see the mmsdrrestore command in the
Command Reference Guide.
ccr_auth_keys_ok STATE_CHANGE INFO no Message: The security file that is used by GPFS CCR is OK {0}.
Cause: N/A
ccr_client_init_disabled STATE_CHANGE INFO no Message: GPFS CCR client initialization is not checked on this node.
Description: The check for GPFS CCR client initialization is disabled on this
node, since it is not a quorum node.
Cause: N/A
ccr_client_init_fail STATE_CHANGE ERROR no Message: GPFS CCR client initialization has failed.
Item={0},ErrMsg={1},Failed={2}.
Description: The GPFS CCR client initialization has failed. For more
information, see message.
Cause: The item specified in the message is either not available or corrupt.
User Action: Recover this degraded node from a still intact node by using
the mmsdrrestore -p <NODE> command with NODE specifying intact
node. For more information, see the mmsdrrestore command in the
Command Reference Guide.
Cause: N/A
ccr_client_init_warn STATE_CHANGE WARNING no Message: GPFS CCR client initialization has failed.
Item={0},ErrMsg={1},Failed={2}.
Description: The GPFS CCR client initialization has failed. For more
information, see message.
Cause: The item specified in the message is either not available or corrupt.
User Action: Recover this degraded node from a still intact node by using
the mmsdrrestore -p <NODE> command with NODE specifying intact
node. For more information, see the mmsdrrestore command in the
Command Reference Guide.
ccr_comm_dir_disabled STATE_CHANGE INFO no Message: The files that are commited to the GPFS CCR are not checked on
this node.
Description: The check for the files that are commited to the GPFS CCR is
disabled on this node, since it is not a quorum node.
Cause: N/A
ccr_comm_dir_fail STATE_CHANGE ERROR FTDC Message: The files committed to the GPFS CCR are not complete or
uploa corrupt. Item={0},ErrMsg={1},Failed={2}.
d
Description: The files committed to the GPFS CCR are not complete or
corrupt. For more information, see message.
User Action: Check the local disk space and remove not necessary
files. Recover this degraded node from a still intact node by using the
mmsdrrestore -p <NODE> command with NODE specifing intact node.
For more information, see the mmsdrrestore command in the Command
Reference Guide.
ccr_comm_dir_ok STATE_CHANGE INFO no Message: The files committed to the GPFS CCR are complete and intact
{0}.
Description: The files committed to the GPFS CCR are complete and intact.
Cause: N/A
ccr_comm_dir_warn STATE_CHANGE WARNING no Message: The files that are committed to the GPFS CCR are not complete
or corrupt. Item={0},ErrMsg={1},Failed={2}.
Description: The files that are committed to the GPFS CCR are not
complete or corrupt. For more information, see message.
User Action: Check the local disk space and remove not necessary
files. Recover this degraded node from a still intact node by using the
mmsdrrestore -p <NODE> command with NODE specifing intact node.
For more information, see the mmsdrrestore command in the Command
Reference Guide.
ccr_ip_lookup_disabled STATE_CHANGE INFO no Message: The IP address lookup for the GPFS CCR component is not
checked on this node.
Description: The check for the IP address lookup for the GPFS CCR
component is disabled on this node, since it is not a quorum node.
Cause: N/A
ccr_ip_lookup_ok STATE_CHANGE INFO no Message: The IP address lookup for the GPFS CCR component is OK {0}.
Description: The IP address lookup for the GPFS CCR component is OK.
Cause: N/A
ccr_ip_lookup_warn STATE_CHANGE WARNING no Message: The IP address lookup for the GPFS CCR component takes too
long. Item={0},ErrMsg={1},Failed={2}.
Description: The IP address lookup for the GPFS CCR component takes too
long, resulting in slow administration commands. For more information, see
message.
ccr_local_server_disabled STATE_CHANGE INFO no Message: The local GPFS CCR server is not checked on this node.
Description: The check for the local GPFS CCR server is disabled on this
node, since it is not a quorum node.
Cause: N/A
ccr_local_server_ok STATE_CHANGE INFO no Message: The local GPFS CCR server is reachable {0}.
Cause: N/A
ccr_local_server_warn STATE_CHANGE WARNING no Message: The local GPFS CCR server is not reachable.
Item={0},ErrMsg={1},Failed={2}.
Description: The local GPFS CCR server is not reachable. For more
information, see message.
Cause: Either the local network or firewall is configured wrong, or the local
GPFS daemon does not respond.
User Action: Check the network and firewall configuration with regards to
the used GPFS communication port (default: 1191). Restart GPFS on this
node.
ccr_paxos_12_disabled STATE_CHANGE INFO no Message: The stored GPFS CCR state is not checked on this node.
Description: The check for the stored GPFS CCR state is disabled on this
node, since it is not a quorum node.
Cause: N/A
ccr_paxos_12_fail STATE_CHANGE ERROR FTDC Message: The stored GPFS CCR state is corrupt.
uploa Item={0},ErrMsg={1},Failed={2}.
d
Description: The stored GPFS CCR state is corrupt. For more information,
see message.
Cause: The CCR on quorum nodes has inconsistent states. Use the mmccr
check -e command to check the detailed status.
User Action: Recover this degraded node from a still intact node by using
the mmsdrrestore -p <NODE> command with NODE specifying intact
node. For more information, see the mmsdrrestore command in the
Command Reference Guide.
ccr_paxos_12_ok STATE_CHANGE INFO no Message: The stored GPFS CCR state is OK {0}.
Cause: N/A
ccr_paxos_12_warn STATE_CHANGE WARNING no Message: The stored GPFS CCR state is corrupt.
Item={0},ErrMsg={1},Failed={2}.
Description: The stored GPFS CCR state is corrupt. For more information,
see message.
ccr_paxos_cached_disabled STATE_CHANGE INFO no Message: The stored GPFS CCR state is not checked on this node.
Description: The check for the stored GPFS CCR state is disabled on this
node, since it is not a quorum node.
Cause: N/A
ccr_paxos_cached_fail STATE_CHANGE ERROR no Message: The stored GPFS CCR state is corrupt.
Item={0},ErrMsg={1},Failed={2}.
Description: The stored GPFS CCR state is corrupt. For more information,
see message.
Cause: Either the stored GPFS CCR state file is corrupt or empty.
User Action: Recover this degraded node from a still intact node by using
the mmsdrrestore -p <NODE> command with NODE specifing intact
node. For more information, see the mmsdrrestore command in the
Command Reference Guide.
ccr_paxos_cached_ok STATE_CHANGE INFO no Message: The stored GPFS CCR state is OK {0}.
Cause: N/A
ccr_quorum_nodes_disabled STATE_CHANGE INFO no Message: The quorum nodes reachability is not checked on this node.
Cause: N/A
ccr_quorum_nodes_fail STATE_CHANGE ERROR no Message: A majority of the quorum nodes are not reachable over the
management network. Item={0},ErrMsg={1},Failed={2}.
Description: A majority of the quorum nodes are not reachable over the
management network. GPFS declares quorum loss. For more information,
see message.
Cause: The quorum nodes cannot communicate with each other caused by
a network or firewall misconfiguration.
User Action: Check the network or firewall (default port 1191 must not be
blocked) configuration of the not reachable quorum nodes.
ccr_quorum_nodes_ok STATE_CHANGE INFO no Message: All quorum nodes are reachable {0}.
Cause: N/A
User Action: Check the network or firewall (default port 1191 must not be
blocked) configuration of the not reachable quorum node.
ccr_tiebreaker_dsk_disabled STATE_CHANGE INFO no Message: The accessibility of the tiebreaker disks that are used by the
GPFS CCR is not checked on this node.
Description: The accessibility check for the tiebreaker disks that are used
by the GPFS CCR is disabled on this node, since it is not a quorum node.
Cause: N/A
ccr_tiebreaker_dsk_ok STATE_CHANGE INFO no Message: All tiebreaker disks that are used by the GPFS CCR are accessible
{0}.
Description: All tiebreaker disks that are used by the GPFS CCR are
accessible.
Cause: N/A
ccr_tiebreaker_dsk_warn STATE_CHANGE WARNING no Message: At least one tiebreaker disk is not accessible.
Item={0},ErrMsg={1},Failed={2}.
cluster_connections_bad STATE_CHANGE WARNING no Message: Connection to cluster node {0} has {1} bad connection(s).
(Maximum {2}).
Description: The cluster internal network to a node is in a bad state. Not all
possible connections do work.
Cause: The cluster internal network to a node is in a bad state. Not all
possible connections do work.
User Action: Check whether the cluster network is good. The event
can be manually cleared by using the mmhealth event resolve
cluster_connections_bad command.
cluster_connections_clear STATE_CHANGE INFO no Message: Cleared all cluster internal connection states.
Cause: N/A
cluster_connections_down STATE_CHANGE WARNING no Message: Connection to cluster node {0} has all {1} connection(s) down.
(Maximum {2}).
Cause: The cluster internal network to a node is in a bad state. All possible
connections are down.
User Action: Check whether the cluster network is good. The event
can be manually cleared by using the mmhealth event resolve
cluster_connections_down command.
cluster_connections_ok STATE_CHANGE INFO no Message: All connections are good for target ip {0}.
Cause: N/A
csm_resync_forced STATE_CHANGE_EXTE INFO no Message: All events and state are transferred to the cluster manager.
RNAL
Description: All events and state are transferred to the cluster manager.
csm_resync_needed STATE_CHANGE_EXTE WARNING no Message: Forwarding of an event to the cluster manager failed multiple
RNAL times.
User Action: Check state and connection of the cluster manager node.
Then, run the mmhealth node show --resync command.
deadlock_detected STATE_CHANGE WARNING no Message: The cluster detected a file system deadlock in the IBM Storage
Scale file system.
Description: The cluster detected a deadlock in the IBM Storage Scale file
system.
disk_call_home INFO_EXTERNAL ERROR servic Message: Disk requires replacement: event:{0}, eventName:{1}, rgName:
e {2}, daName:{3}, pdName:{4}, pdLocation:{5}, pdFru:{6}, rgErr:{7},
ticket rgReason:{8}.
disk_call_home2 INFO_EXTERNAL ERROR servic Message: Disk requires replacement: event:{0}, eventName:{1}, rgName:
e {2}, daName:{3}, pdName:{4}, pdLocation:{5}, pdFru:{6}, rgErr:{7},
ticket rgReason:{8}.
ess_ptf_update_available TIP TIP no Message: For the currently installed IBM ESS packages, the PTF update {0}
PTF {1} is available.
Description: For the currently installed IBM ESS packages a PTF update is
available.
Cause: PTF updates are available for the currently installed 'gpfs.ess.tools'
package.
User Action: Visit IBM Fix Central to download and install the updates.
User Action: Use mmhealth event list hidden command to see all
hidden events. Use the mmhealth event unhide command to show the
event again.
event_test_info INFO INFO no Message: Test info event that is received from GPFS daemon. Arg0:{0}
Arg1:{1}.
Cause: N/A
User Action: To raise this test event, run the mmfsadm test
raiseRASEvent 0 arg1txt arg2txt command. The event shows up in
the event log. For more information, see the mmhealth node eventlog
command.
event_test_ok STATE_CHANGE INFO no Message: Test OK event that is received from GPFS daemon for entity: {id}
Arg0:{0} Arg1:{1}.
Cause: N/A
event_test_statechange STATE_CHANGE WARNING no Message: Test State-Change event that is received from GPFS daemon for
entity: {id} Arg0:{0} Arg1:{1}.
User Action: To raise this test event, run the mmfsadm test
raiseRASEvent 1 id arg1txt arg2txt command. The event
changes the GPFS state to DEGRADED. For more information, see the
mmhealth node show command. Raise the 'event_test_ok' event to
change state back to HEALTHY.
Description: An event was unhidden, which means that the event affects
its component's state when it is active. Furthermore, it is shown in the
event table of the mmhealth node show ComponentName command
without the '--verbose' flag.
User Action: If this is an active TIP event, then fix it or hide by using the
mmhealth event hide command.
gpfs_cache_cfg_high TIP TIP no Message: The GPFS cache settings may be too high for the installed total
memory.
Cause: The configured cache settings are close to the total memory.
The settings for pagepool, maxStatCache, and maxFilesToCache, in total,
exceed the recommended value, which is 90% by default.
User Action: For more information on maxStatCache size, see the 'Cache
usage' section in the Administration Guide. Check whether there is enough
memory available.
gpfs_cache_cfg_ok TIP INFO no Message: The GPFS cache memory configuration is OK.
Description: The GPFS cache memory configuration is OK. The values for
maxFilesToCache, maxStatCache, and pagepool fit to the amount of total
memory and configured services.
gpfs_deadlock_detection_ok TIP INFO no Message: The GPFS deadlockDetectionThreshold is greater than zero.
gpfs_down STATE_CHANGE ERROR no Message: The IBM Storage Scale service process not running on this node.
Normal operation cannot be done.
Description: The IBM Storage Scale service is not running. This can be an
expected state when the IBM Storage Scale service is shutdown.
User Action: Check the state of the IBM Storage Scale file system daemon,
and check for the root cause in the /var/adm/ras/mmfs.log.latest
log.
gpfs_maxfilestocache_ok TIP INFO no Message: The GPFS maxFilesToCache is greater than 100,000.
gpfs_maxfilestocache_small TIP TIP no Message: The GPFS maxfilestocache is smaller than or equal to 100,000.
gpfs_maxstatcache_high TIP TIP no Message: The GPFS maxStatCache is greater than 0 on a Linux system.
gpfs_maxstatcache_low TIP TIP no Message: The GPFS maxStatCache is smaller than the maxFilesToCache
setting.
User Action: For more information on the maxStatCache size, see the
'Cache usage' section in the Administration Guide. In case that the current
setting fits your needs, hide the event either by using the GUI or the
mmhealth event hide command. The maxStatCache can be changed
by using the mmchconfig command. Consider that the actively used
configuration is monitored. You can list the actively used configuration by
using the mmdiag --config command, which includes changes that are
not activated as yet.
gpfs_maxstatcache_ok TIP INFO no Message: The GPFS maxStatCache is set to default or at least to the
maxFilesToCache value.
gpfs_pagepool_ok TIP INFO no Message: The GPFS pagepool is greater than 1G.
Description: The GPFS pagepool is greater than 1G. Consider that the
actively used configuration is monitored. You can see the actively used
configuration by using the mmdiag --config command.
gpfs_pagepool_small TIP TIP no Message: The GPFS pagepool is less than or equal to 1G.
User Action: For more information on the pagepool size, see the 'Cache
usage' section in the Administration Guide. Although, the pagepool should
be greater than 1G, there are situations in which the administrator decides
against a pagepool that is greater than 1G. In this case or in case that
the current setting fits, hide the event by using the GUI or the mmhealth
event hide command. The pagepool can be changed by using the
mmchconfig command. The 'gpfs_pagepool_small' event automatically
disappears as soon as the new pagepool value, which is greater than
1G is active. Use the mmchconfig -i flag command or restart if
required. For more information, see the mmchconfig command in the
Command Reference Guide. Consider that the actively used configuration is
monitored. You can list the actively used configuration by using the mmdiag
--config command, which includes changes that are not activated as yet.
gpfs_unresponsive STATE_CHANGE ERROR no Message: The IBM Storage Scale service process is unresponsive on this
node. Normal operation cannot be done.
User Action: Check the state of the IBM Storage Scale file system daemon,
and check for the root cause in the /var/adm/ras/mmfs.log.latest
log.
gpfs_up STATE_CHANGE INFO no Message: The IBM Storage Scale service process is running.
Cause: N/A
gpfs_warn INFO WARNING no Message: The IBM Storage Scale process monitoring returned unknown
result. This can be a temporary issue.
Description: Check whether the IBM Storage Scale file system daemon
returned an unknown result. This can be a temporary issue, like a timeout
during the check procedure.
Cause: The IBM Storage Scale file system daemon state cannot be
determined due to a problem.
gpfsport_access_down STATE_CHANGE ERROR no Message: No access to IBM Storage Scale ip {0} port {1}. Check the firewall
settings.
Description: The access check of the local IBM Storage Scale file system
daemon port has failed.
User Action: Check whether the IBM Storage Scale file system daemon is
running and check the firewall for blocking rules on this port.
gpfsport_access_up STATE_CHANGE INFO no Message: Access to IBM Storage Scale ip {0} port {1} is OK.
Description: The TCP access check of the local IBM Storage Scale file
system daemon port was successful.
Cause: N/A
gpfsport_access_warn INFO WARNING no Message: IBM Storage Scale access check ip {0} port {1} failed. Check for a
valid IBM Storage Scale-IP.
Description: The access check of the IBM Storage Scale file system
daemon port has returned an unknown result.
Cause: The IBM Storage Scale file system daemon port access cannot be
determined due to a problem.
User Action: Find potential issues for this kind of failure in the logs.
gpfsport_down STATE_CHANGE ERROR no Message: IBM Storage Scale port {0} is not active.
Description: The expected local IBM Storage Scale file system daemon
port was not detected.
Cause: The IBM Storage Scale file system daemon is not running.
User Action: Check whether the IBM Storage Scale service is running.
gpfsport_up STATE_CHANGE INFO no Message: IBM Storage Scale port {0} is active.
Description: The expected local IBM Storage Scale file system daemon
port was detected.
Cause: N/A
gpfsport_warn INFO WARNING no Message: IBM Storage Scale monitoring ip {0} port {1} has returned an
unknown result.
Description: The check of the IBM Storage Scale file system daemon port
has returned an unknown result.
Cause: The IBM Storage Scale file system daemon port cannot be
determined due to a problem.
User Action: Find potential issues for this kind of failure in the logs.
info_on_duplicate_events INFO INFO no Message: The event {0} {id} was repeated {1} times.
kernel_io_hang_detected STATE_CHANGE ERROR no Message: A kernel IO hang has been detected on disk {0} affecting file
system {1}.
Cause: N/A
User Action: Check the underlying storage system and reboot the node to
revolve the current hang condition.
kernel_io_hang_resolved STATE_CHANGE INFO no Message: A kernel IO hang on disk {id} has been resolved.
Cause: N/A
local_fs_filled STATE_CHANGE WARNING no Message: The local file system with the mount point {1} used for {0}
reached a warning level with less than 1000 MB, but, more than 100 MB
free space.
Cause: The local file systems reached a warning level of under 1000 MB.
User Action: Detect large file on the local file system by using the 'du -cks *
|sort -rn |head -11' command, and delete or move data to free space.
local_fs_full STATE_CHANGE ERROR no Message: The local file system with the mount point {1} used for {0}
reached a nearly exhausted level, which is less than 100 MB free space.
Cause: The local file systems have reached a nearly exhausted level, which
is less than 100 MB.
User Action: Detect large file on the local file system by using the 'du -cks *
|sort -rn |head -11' command, and delete or move data to free space.
local_fs_normal STATE_CHANGE INFO no Message: The local file system with the mount point {1} used for {0}
reached a normal level with more than 1000 MB free space.
Cause: N/A
local_fs_path_not_found STATE_CHANGE INFO no Message: The configured dataStructureDump path {0} does not exists. Skip
monitoring.
Cause: N/A
local_fs_unknown INFO WARNING no Message: The fill level of local file systems is unknown because of a non-
expected output of the df command. Return Code: {0} Error: {1}.
Cause: Cannot determine the fill states of the local file systems, which may
be caused through a return code unequal to 0 from the df command or an
unexpected output format.
User Action: Check whether the df command exists on the node, and
whether there are time issues with the df command or it may run into a
timeout.
longwaiters_found STATE_CHANGE ERROR no Message: Detected IBM Storage Scale longwaiter threads.
Description: Longwaiter threads are found in the IBM Storage Scale file
system.
User Action: Check log files and the output of the mmdiag --waiters
command to identify the root cause. This can be also due to a temporary
issue.
longwaiters_warn INFO WARNING no Message: IBM Storage Scale longwaiters monitoring has returned an
unknown result.
Cause: The IBM Storage Scale file system longwaiters check cannot be
determined due to a problem.
User Action: Find potential issues for this kind of failure in the logs.
mmfsd_abort_clear STATE_CHANGE INFO no Message: Resolve event for IBM Storage Scale issue signal.
Cause: N/A
mmfsd_abort_warn STATE_CHANGE WARNING FTDC Message: IBM Storage Scale reported an issue {0}.
uploa
d Description: The mmfsd daemon process may have terminated
abnormally.
Cause: IBM Storage Scale signaled an issue. The mmfsd daemon process
might have terminated abnormally.
monitor_started INFO INFO no Message: The IBM Storage Scale monitoring service has been started.
Description: The IBM Storage Scale monitoring service has been started
and is actively monitoring the system components.
Cause: N/A
User Action: Use the mmhealth command to query the monitoring status.
no_longwaiters_found STATE_CHANGE INFO no Message: No IBM Storage Scale longwaiters are found.
Description: No longwaiter threads are found in the IBM Storage Scale file
system.
Cause: N/A
Cause: N/A
node_call_home INFO_EXTERNAL ERROR servic Message: OPAL logs reported a problem: event:{0}, eventId:{1}, myNode:
e {2}.
ticket
Description: OPAL logs reported a problem via callhomemon.sh, which
requires IBM support attention.
node_call_home2 INFO_EXTERNAL ERROR servic Message: OPAL logs reported a problem: event:{0}, eventId:{1}, myNode:
e {2}.
ticket
Description: OPAL logs reported a problem via callhomemon.sh, which
requires IBM support attention.
nodeleave_info INFO_EXTERNAL INFO no Message: A CES node left the cluster: Node {0}.
Cause: A CES node left the cluster. The name of the node that is leaving the
cluster is provided.
nodestatechange_info INFO_EXTERNAL INFO no Message: A CES node state change: Node {0} {1} {2} flag.
Cause: A node state change was detected. Details are shown in the
message.
numactl_not_installed TIP TIP no Message: The numactl tool is not found, but needs to be installed.
User Action: Install the required numactl tool. For example, run the 'yum
install numactl' command on RHEL. In case the numactl is not available
for your operating system, disable numaMemoryInterleave setting on this
node.
Cause: N/A
out_of_memory STATE_CHANGE WARNING no Message: Detected out-of-memory killer conditions in system log.
Cause: The dmesg command returned log entries, which are written by the
OOM killer.
User Action: Check the memory usage on the node. Identify the reason
for the out-of-memory condition and check the system log to find out
which processes have been killed by OOM killer. You might need to recover
these processes manually or reboot the system to get to a clean state. Run
the mmhealth event resolve out_of_memory command once you
recovered the system to remove this warning event from the mmhealth
command.
Cause: N/A
passthrough_query_hang STATE_CHANGE ERROR no Message: An SCSI pass-through query request hang has been detected on
disk {0} affecting file system {1}. Reason: {2}.
Cause: N/A
User Action: Check the underlying storage system and reboot the node to
revolve the current hang condition.
quorum_down STATE_CHANGE ERROR no Message: The node is not able to reach enough quorum nodes/disks to
work properly.
Cause: The node is trying to form a quorum with the other available nodes.
The cluster service may not be running or the communication with other
nodes is faulty.
User Action: Check whether the cluster service is running and other
quorum nodes can be reached over the network. Check the local firewall
settings.
quorum_even_nodes_no_tieb STATE_CHANGE TIP no Message: No tiebreaker disk is defined with an even number of quorum
reaker nodes.
quorum_ok STATE_CHANGE INFO no Message: The quorum configuration corresponds to the best practices.
Cause: N/A
quorum_too_little_nodes TIP TIP no Message: An odd number of at least 3 quorum nodes is recommended.
quorum_two_tiebreaker_coun STATE_CHANGE TIP no Message: Change number of tiebreaker disks to an odd number.
t
Description: Number of tiebreaker disks is two.
Cause: N/A
quorum_warn INFO WARNING no Message: The IBM Storage Scale quorum monitor cannot be executed. This
can be a timeout issue.
quorumloss INFO_EXTERNAL WARNING no Message: The cluster has detected a quorum loss.
Cause: The number of required quorum nodes does not match the
minimum requirements. This can be an expected situation.
User Action: Ensure the required cluster quorum nodes are up and
running.
Description: Reconnect failed, which may due to a network error. Check for
a network error.
Cause: N/A
Cause: N/A
rpc_waiters STATE_CHANGE WARNING no Message: Pending RPC messages were found for the nodes: {0}.
User Action: If nodes do not respond to pending RPC messages, you might
need to expel the nodes by using the mmexpelnode -N <ip> command.
rpc_waiters_expel INFO WARNING no Message: A request to expel the node {id} was sent to the cluster node {1}
because of pending RPC messages.
User Action: Verify the logs in the expelled node to find the reason for
the pending RPC messages. For example, node resources, such as memory,
might be exhausted. Use the mmexpelnode -r -N <ip> command to
allow the node to join the cluster again.
scale_ptf_update_available TIP TIP no Message: For the currently installed IBM Storage Scale packages, the PTF
update {0} PTF {1} is available.
Description: For the currently installed IBM Storage Scale packages, a PTF
update is available.
Cause: PTF updates are available for the currently installed 'gpfs.base'
package.
User Action: Visit IBM Fix Central to download and install the updates.
scale_up_to_date STATE_CHANGE INFO no Message: The last software update check showed no available updates.
Cause: N/A
scale_updatecheck_disabled STATE_CHANGE INFO no Message: The IBM Storage Scale software update check feature is
disabled.
Cause: N/A
Description: The CES shared root file system's ACLs are different to default
in CCR. If this ACLs prohibits read access of rpc.stadt, then NFS do not
work correctly.
Cause: The CES framework detects that the ACLs of the CES shared root
file system are different the default in CCR.
User Action: Verify that the user assigned to rpc.statd (such as, rpcuser)
has read access to the CES shared root file system.
Description: The CES shared root file system's ACLs are default. These
ACLs give read access to rpc.stadt when default GPFS user settings are
used.
Cause: N/A
Description: The CES shared root file system is bad or not available. This
file system is required to run the cluster because it stores cluster wide
information. This problem triggers a failover.
Cause: The CES framework detects the CES shared root file system to be
unavailable on the node.
User Action: Check whether the CES shared root file system and other
expected IBM Storage Scale file systems are mounted properly.
Description: The CES shared root file system is available. This file system is
required to run the cluster because it stores cluster wide information.
Cause: N/A
test_call_home INFO_EXTERNAL ERROR servic Message: A test call home ticket is created.
e
ticket Description: A test call home ticket is created.
Cause: ESS tooling triggered a test call home to verify that tickets can be
created from this system.
total_memory_small TIP TIP no Message: The total memory is less than the recommended value.
Description: The total memory is less than the recommended value when
CES protocol services are enabled.
Cause: The total memory is less than the recommendation for the currently
enabled services, which is 128GB if SMB is enabled, or 64 GB for each, NFS
and object.
waitfor_verbsport INFO_EXTERNAL INFO no Message: Waiting for verbs ports to become active.
Cause: N/A
waitfor_verbsport_done INFO_EXTERNAL INFO no Message: Waiting for verbs ports is done {0}.
Cause: N/A
waitfor_verbsport_failed INFO_EXTERNAL ERROR no Message: Fail to startup because some IB ports or Ethernet devices, which
in verbsPorts are inactive: {0}.
User Action: Check IB ports and Ethernet devices, which are listed in
verbsPorts configuration. Increase verbsPortsWaitTimeout or enable the
verbsRdmaFailBackTCPIfNotAvailable configuration.
GUI events
The following table lists the events that are created for the GUI component.
Table 86. Events for the GUI component
bmc_connection_error STATE_CHANGE ERROR no Message: Unable to connect to BMC of POWER server {0} because an
error occurred when running the '/opt/ibm/ess/tools/bin/esshwinvmon.py
-t check -n {1}' command.
Description: The GUI checks the connection to the BMC of the POWER
server.
Cause: The GUI cannot query the BMC of the POWER server because of an
error that occurred in the 'esshwinvmon.py' script.
bmc_connection_failed STATE_CHANGE ERROR no Message: Unable to connect to BMC of POWER server {0}.
Description: The GUI checks the connection to the BMC of the POWER
server.
Cause: The GUI cannot connect to the BMC of the POWER server.
User Action: Check whether the BMC IPs and passwords are correct
defined in the '/opt/ibm/ess/tools/conf/hosts.yml' configuration file on the
GUI node. Run the '/opt/ibm/ess/tools/bin/esshwinvmon.py -t check -n
[node_name]' command to check connection to BMC.
bmc_connection_ok STATE_CHANGE INFO no Message: The connection to the BMC of POWER server {0} is OK.
Description: The GUI checks the connection to the BMC of the POWER
server.
Cause: The GUI can communicate to the BMC of the POWER server
successfully.
bmc_connection_unconfigure STATE_CHANGE ERROR no Message: Unable to query health state of POWER server {0} from the BMC.
d The '/opt/ibm/ess/tools/conf/hosts.yml'configuration file does not contain
a section for node {1}.
Description: The GUI checks the connection to the BMC of the POWER
server.
Cause: The GUI cannot connect to the BMC of the POWER server because
of a misconfiguration.
User Action: Add a section for the specified node to the '/opt/ibm/ess/
tools/conf/hosts.yml' configuration file on the GUI node. Run the
'/opt/ibm/ess/tools/bin/esshwinvmon.py -t check -n [node_name]'
command to check connection to BMC.
gui_cluster_down STATE_CHANGE ERROR no Message: The GUI detected that the cluster is down.
User Action: Check for the reason that led the cluster to lost quorum.
gui_cluster_state_unknown STATE_CHANGE WARNING no Message: The GUI cannot determine the cluster state.
Cause: The GUI cannot determine whether enough quorum nodes are up
and running.
gui_cluster_up STATE_CHANGE INFO no Message: The GUI detected that the cluster is up and running.
gui_config_cluster_id_mismat STATE_CHANGE ERROR no Message: The cluster ID of the current cluster '{0}', and the cluster ID in the
ch database do not match ('{1}'). It seems that the cluster was re-created.
Cause: N/A
User Action: Clear the GUI database of the old cluster information by
dropping all 'psql postgres postgres -c' tables by using the 'drop schema
fscc cascade' command. Then, restart the GUI by using the systemctl
restart gpfsgui command.
gui_config_cluster_id_ok STATE_CHANGE INFO no Message: The cluster ID of the current cluster '{0}' matches the cluster ID
in the database.
Cause: N/A
gui_config_command_audit_o STATE_CHANGE WARNING no Message: Command Audit is turned off at the cluster level.
ff_cluster
Description: Command Audit is turned off at the cluster level. This
configuration leads to lags in the refresh of data displayed in the GUI.
gui_config_command_audit_o STATE_CHANGE WARNING no Message: Command Audit is turned off on the following nodes: {1}.
ff_nodes
Description: Command Audit is turned off on some nodes. This
configuration leads to lags in the refresh of data that is displayed in the
GUI.
gui_config_command_audit_o STATE_CHANGE INFO no Message: Command Audit is turned on at the cluster level.
k
Description: Command Audit is turned on at the cluster level. This way
the GUI refreshes the data that it displays automatically when IBM Storage
Scale commands are run by using the CLI on other nodes in the cluster.
Cause: N/A
gui_config_sudoers_error STATE_CHANGE ERROR no Message: There is a problem with the '/etc/sudoers' configuration. The
secure_path of the IBM Storage Scale management user 'scalemgmt' is not
correct. Current value: {0} / Expected value: {1}.
Cause: N/A
Cause: N/A
User Action: Events that are marked as read are now displayed as unread.
Mark all notices as read if they are no longer valid after the cluster change.
gui_database_cleared_downg INFO WARNING no Message: The GUI version read from the database({0}) is later than the GUI
rade code version({1}).
Cause: The GUI version that is stored in the database is greater than the
GUI code version.
User Action: Events that are marked as read are now displayed as unread.
Mark all notices as read if they are no longer valid after the GUI is moved to
an older version.
gui_database_dropped INFO_EXTERNAL WARNING no Message: The database version ({0}) mismatches the PostgreSQL
version({1}).
User Action: Events that are marked as read are now displayed as unread.
Mark all notices as read if they are no longer valid after the upgrade.
gui_db_ok STATE_CHANGE INFO no Message: The GUI reported correct connection to postgres database in the
cluster.
gui_db_warn STATE_CHANGE WARNING no Message: The GUI reported incorrect connection to postgres database.
User Action: Check if postgres container works properly in the GUI pod.
gui_down STATE_CHANGE ERROR no Message: The GUI service should be {0}, but it is {1}. If there are no
other GUI nodes up and running, then no snapshots are created and email
notifications are not sent anymore.
Cause: The GUI service is not running on this node, although it has the
'GUI_MGMT_SERVER_NODE' node class.
User Action: Restart the GUI service or change the node class for this
node.
Cause: N/A
gui_email_server_unreachabl STATE_CHANGE ERROR no Message: The email server {0} is unreachable {1}.
e
Description: The specified email server does not respond to any messages.
gui_external_authentication_f INFO ERROR no Message: The GUI cannot connect to the external LDAP or AD server: {0}.
ailed
Description: The GUI cannot connect to one or more of the specified LDAP
or AD servers.
User Action: Verify that the configured LDAP or AD servers are up and
running and reachable from the GUI node.
gui_login_attempt_failed INFO_EXTERNAL WARNING no Message: A login attempt failed for the user {0} from the source IP address
{1}.
gui_mount_allowed_on_gui_ STATE_CHANGE INFO no Message: Mount operation is allowed for all file systems on the GUI node.
node
Description: Mount operation is allowed for all file systems on the GUI
node.
Cause: Mount operation is allowed for all file systems on the GUI node.
gui_mount_prevented_on_gui STATE_CHANGE WARNING no Message: Mount operation is prevented for {1} file systems on the GUI
_node node {0}.
Cause: Mount operation is prevented for specific file systems on the GUI
node.
User Action: Run the fix procedure or go to the file system panel, and allow
mount operation for mentioned file systems on the GUI node.
gui_node_update_successful STATE_CHANGE INFO no Message: GUI node class got updated successfully.
gui_out_of_memory INFO ERROR no Message: The GUI reported an internal out-of-memory state. Restart the
GUI.
Cause: The Java virtual machine of the GUI reported an internal out-of-
memory state.
gui_pmcollector_connection_f STATE_CHANGE ERROR no Message: The GUI cannot connect to the pmcollector that is running on {0}
ailed using port {1}.
User Action: Check whether the pmcollector service is running and verify
the firewall or network settings. If the problem still persists, then check
whether the GUI node is specified for the 'colCandidates' attribute in the
mmperfmon config show command.
gui_pmcollector_connection_ STATE_CHANGE INFO no Message: The GUI can connect to the pmcollector that is running on {0}
ok using port {1}.
gui_pmsensors_connection_f STATE_CHANGE ERROR no Message: The performance monitoring sensor service 'pmsensors' on node
ailed {0} is not sending any data.
Description: The GUI checks whether data can be retrieved from the
pmcollector service for this node.
gui_pmsensors_connection_o STATE_CHANGE INFO no Message: The state of performance monitoring sensor service 'pmsensor'
k on node {0} is OK.
Description: The GUI checks whether data can be retrieved from the
pmcollector service for this node.
gui_quorum_ok STATE_CHANGE INFO no Message: The GUI reported correct quorum in the cluster.
gui_quorum_warn STATE_CHANGE WARNING no Message: The GUI reported quorum loss in the cluster.
gui_reachable_node STATE_CHANGE INFO no Message: The GUI can reach the node {0}.
gui_refresh_task_failed STATE_CHANGE WARNING no Message: The following GUI refresh task(s) failed: {0}.
Description: One or more GUI refresh tasks failed, which means that data
in the GUI is outdated.
gui_refresh_task_successful STATE_CHANGE INFO no Message: All GUI refresh tasks are running fine.
Cause: N/A
gui_response_ok STATE_CHANGE INFO no Message: The GUI is responsive to the test query.
gui_response_warn STATE_CHANGE WARNING no Message: The GUI is unresponsive to the test query.
Cause: The GUI did not respond with the expected data for the test query
(debug platform).
gui_snap_create_failed_fs INFO ERROR no Message: A snapshot creation invoked by rule {1} failed on file system {0}.
Description: The snapshot was not created according to the specified rule.
gui_snap_create_failed_fset INFO ERROR no Message: A snapshot creation that is invoked by rule {1} failed on file
system {2}, fileset {0}.
Description: The snapshot was not created according to the specified rule.
gui_snap_delete_failed_fs INFO ERROR no Message: A snapshot deletion that is invoked by rule {1} failed on file
system {0}.
Description: The snapshot was not deleted according to the specified rule.
gui_snap_delete_failed_fset INFO ERROR no Message: A snapshot deletion that is invoked by rule {1} failed on file
system {2}, fileset {0}.
Description: The snapshot was not deleted according to the specified rule.
gui_snap_rule_ops_exceeded INFO WARNING no Message: The number of pending operations exceeds {1} operations for
rule {2}.
gui_snap_running INFO WARNING no Message: Operations for rule {1} are still running at the start of the next
management of rule {1}.
Description: Operations for a rule are still running at the start of the next
management of that rule.
gui_snap_time_limit_exceede INFO WARNING no Message: A snapshot operation exceeds {1} minutes for rule {2} on file
d_fs system {0}.
gui_snap_time_limit_exceede INFO WARNING no Message: A snapshot operation exceeds {1} minutes for rule {2} on file
d_fset system {3}, fileset {0}.
gui_snap_total_ops_exceeded INFO WARNING no Message: The total number of pending operations exceeds {1} operations.
gui_ssh_ok STATE_CHANGE INFO no Message: The GUI reported correct ssh connection in the cluster.
gui_ssh_warn STATE_CHANGE WARNING no Message: The GUI reported incorrect ssh connection.
gui_ssl_certificate_expired STATE_CHANGE ERROR no Message: The SSL certificate that is used by the GUI expired. Expiration
date was {0}.
gui_ssl_certificate_is_about_t STATE_CHANGE WARNING no Message: The SSL certificate that is used by the GUI is about to expire.
o_expire Expiration date is {0}.
Cause: The SSL certificate that is used by the GUI is about to expire.
User Action: Go to the Service panel and select 'GUI'. On the 'Nodes' tab,
select an option to create a new certificate request, self-signed certificate,
or upload your own certificate.
gui_ssl_certificate_ok STATE_CHANGE INFO no Message: The SSL certificate that is used by the GUI is valid. Expiration
date is {0}.
Cause: N/A
gui_unreachable_node STATE_CHANGE ERROR no Message: The GUI cannot reach the node {0}.
User Action: Check your firewall or network setup, and whether the
specified node is up and running.
gui_up STATE_CHANGE INFO no Message: The status of the GUI service is {0} as expected.
gui_warn INFO INFO no Message: The GUI service returned an unknown result.
host_disk_filled STATE_CHANGE WARNING no Message: A local file system on node {0} reached a warning level {1}.
Description: The GUI checks the fill level of the local file systems.
host_disk_full STATE_CHANGE ERROR no Message: A local file system on node {0} reached a nearly exhausted level
{1}.
Description: The GUI checks the fill level of the local file systems.
host_disk_normal STATE_CHANGE INFO no Message: The local file systems on node {0} reached a normal level.
Description: The GUI checks the fill level of the local file systems.
host_disk_unknown STATE_CHANGE WARNING no Message: The fill level of local file systems on node {0} is unknown.
Description: The GUI checks the fill level of the local file systems.
sudo_admin_not_configured STATE_CHANGE ERROR no Message: Sudo wrappers are enabled on the cluster '{0}', but the GUI is not
configured to use Sudo wrappers.
Description: Sudo wrappers are enabled on the cluster, but the value for
GPFS_ADMIN in '/usr/lpp/mmfs/gui/conf/gpfsgui.properties' was either not
set or is still set to root. The value of 'GPFS_ADMIN' is set to the username
for which sudo wrappers were configured on the cluster.
Cause: N/A
User Action: Ensure that sudo wrappers were correctly configured for a
user that is available on the GUI node and all other nodes of the cluster.
This username is set as the value of the 'GPFS_ADMIN' option in the
'/usr/lpp/mmfs/gui/conf/gpfsgui.properties' file. After the restart, the GUI
starts by using the systemctl restart gpfsgui command.
sudo_admin_not_exist STATE_CHANGE ERROR no Message: Sudo wrappers are enabled on the cluster '{0}', but there
is a misconfiguration that is regarding the user '{1}' that was set as
'GPFS_ADMIN' in the GUI properties file.
Description: Sudo wrappers are enabled on the cluster, but the username
that was set as GPFS_ADMIN in the GUI properties file at '/usr/lpp/
mmfs/gui/conf/gpfsgui.properties' does not exist on the GUI node.
Cause: N/A
User Action: Ensure that sudo wrappers were correctly configured for
a user that is available on the GUI node and all other nodes of the
cluster. This username is set as the value of the 'GPFS_ADMIN' option in
the '/usr/lpp/mmfs/gui/conf/gpfsgui.properties'. After that restart, the GUI
starts by using the systemctl restart gpfsgui command.
sudo_admin_set_but_disable STATE_CHANGE WARNING no Message: Sudo wrappers are not enabled on the cluster '{0}', but
d 'GPFS_ADMIN' was set to a non-root user.
Description: Sudo wrappers are not enabled on the cluster, but the value
for 'GPFS_ADMIN' in the '/usr/lpp/mmfs/gui/conf/gpfsgui.properties' was
set to a non-root user. The value of 'GPFS_ADMIN' is set to 'root' when
sudo wrappers are not enabled on the cluster.
Cause: N/A
sudo_connect_error STATE_CHANGE ERROR no Message: Sudo wrappers are enabled on the cluster '{0}', but the GUI
cannot connect to other nodes with the username '{1}' that was defined as
'GPFS_ADMIN' in the GUI properties file.
Cause: N/A
User Action: Ensure that sudo wrappers were correctly configured for
a user that is available on the GUI node and all other nodes of the
cluster. This username is set as the value of the 'GPFS_ADMIN' option in
the '/usr/lpp/mmfs/gui/conf/gpfsgui.properties'. After that restart, the GUI
starts by using the systemctl restart gpfsgui command.
sudo_ok STATE_CHANGE INFO no Message: Sudo wrappers were enabled on the cluster and the GUI
configuration for the cluster '{0}' is correct.
Description: No problems were found with the current GUI and cluster
configurations.
Cause: N/A
time_in_sync STATE_CHANGE INFO no Message: The time on node {0} is in sync with the cluster median.
Cause: The time on the specified node is in sync with the cluster median.
time_not_in_sync STATE_CHANGE ERROR no Message: The time on node {0} is not in sync with the cluster median.
Cause: The time on the specified node is not in sync with the cluster
median.
time_sync_unknown STATE_CHANGE WARNING no Message: The time on node {0} cannot be determined.
User Action: Check whether the node is reachable from the GUI.
Description: The GUI checks whether xCAT can manage the node.
User Action: Add the node to xCAT. Ensure that the hostname that is used
in xCAT matches the hostname that is known by the node itself.
xcat_nodelist_ok STATE_CHANGE INFO no Message: The node {0} is known to the xCAT.
Description: The GUI checks whether xCAT can manage the node.
xcat_nodelist_unknown STATE_CHANGE WARNING no Message: State of the node {0} in xCAT is unknown.
Description: The GUI checks whether xCAT can manage the node.
xcat_state_error STATE_CHANGE INFO no Message: The xCAT on node {1} failed to operate properly on cluster {0}.
Cause: The node specified as xCAT host is reachable, but either xCAT is not
installed on the node or not operating properly.
User Action: Check xCAT installation and try xCAT command nodes, rinv,
and rvitals for errors.
xcat_state_invalid_version STATE_CHANGE WARNING no Message: The xCAT service has not the recommended version ({1} actual/
recommended).
xcat_state_no_connection STATE_CHANGE ERROR no Message: Unable to connect to xCAT node {1} on cluster {0}.
User Action: Check whether the IP address is correct and ensure that root
has key-based SSH set up to the xCAT node.
xcat_state_ok STATE_CHANGE INFO no Message: The availability of xCAT on cluster {0} is OK.
xcat_state_unconfigured STATE_CHANGE WARNING no Message: The xCAT host is not configured on cluster {0}.
hadoop_datanode_warn INFO WARNING no Message: Hadoop DataNode monitoring returned unknown results.
User Action: If this status persists after a few minutes, then restart the
DataNode service.
hadoop_namenode_warn INFO WARNING no Message: Hadoop NameNode monitoring returned unknown results.
Cause: N/A
User Action: If this status persists after a few minutes, then restart the
NameNode service.
hdfs_datanode_process_dow STATE_CHANGE ERROR no Message: HDFS DataNode process for hdfs cluster {0} is down.
n
Description: The HDFS DataNode process is down.
User Action: Start the Hadoop DataNode process again by using the
'/usr/lpp/mmfs/hadoop/bin/hdfs --daemon start datanode' command.
hdfs_datanode_process_unkn STATE_CHANGE WARNING no Message: HDFS DataNode process for hdfs cluster {0} is unknown.
own
Description: The HDFS DataNode process is unknown.
User Action: Check the HDFS DataNode service. If needed, then restart
it by using the '/usr/lpp/mmfs/hadoop/bin/hdfs --daemon start datanode'
command.
hdfs_datanode_process_up STATE_CHANGE INFO no Message: HDFS DataNode process for hdfs cluster {0} is OK.
hdfs_namenode_active STATE_CHANGE INFO no Message: HDFS NameNode service state for HDFS cluster {0} is ACTIVE.
hdfs_namenode_config_missi STATE_CHANGE WARNING no Message: HDFS NameNode configuration for cluster {0} is missing.
ng
Description: The HDFS NameNode configuration for the HDFS cluster is
missing on this node.
hdfs_namenode_error STATE_CHANGE ERROR no Message: HDFS NameNode health for HDFS cluster {0} is invalid.
User Action: Validate that the HDFS configuration is valid and try to start
the NameNode service manually.
hdfs_namenode_failed STATE_CHANGE ERROR no Message: HDFS NameNode health for HDFS cluster {0} failed.
hdfs_namenode_initializing STATE_CHANGE INFO no Message: HDFS NameNode service state for HDFS cluster {0} is
INITIALIZING.
hdfs_namenode_krb_auth_fai STATE_CHANGE WARNING no Message: HDFS NameNode check health state failed with kinit error for
led cluster {0}.
hdfs_namenode_ok STATE_CHANGE INFO no Message: HDFS NameNode health for HDFS cluster {0} is OK.
hdfs_namenode_process_do STATE_CHANGE ERROR no Message: HDFS NameNode process for HDFS cluster {0} is down.
wn
Description: The HDFS NameNode process is down.
User Action: Start the Hadoop NameNode process by using the mmces
service start hdfs command.
hdfs_namenode_process_unk STATE_CHANGE WARNING no Message: HDFS NameNode process for HDFS cluster {0} is unknown.
nown
Description: The HDFS NameNode process is unknown.
User Action: Check the HDFS Namenode service and if needed, then
restart it by using the mmces service start hdfs command.
hdfs_namenode_process_up STATE_CHANGE INFO no Message: HDFS NameNode process for HDFS cluster {0} is OK.
hdfs_namenode_standby STATE_CHANGE INFO no Message: HDFS NameNode service state for HDFS cluster {0} is in
STANDBY.
hdfs_namenode_stopping STATE_CHANGE INFO no Message: HDFS NameNode service state for HDFS cluster {0} is STOPPING.
hdfs_namenode_unauthorize STATE_CHANGE WARNING no Message: HDFS NameNode check health state failed for cluster {0}.
d
Description: Failed to query the health state because of missing or wrong
authentication token.
hdfs_namenode_unknown_st STATE_CHANGE WARNING no Message: HDFS NameNode service state for HDFS cluster {0} is
ate UNKNOWN.
hdfs_namenode_wrong_state STATE_CHANGE WARNING no Message: HDFS NameNode service state for HDFS cluster {0} is
unexpected {1}.
Keystone events
The following table lists the events that are created for the Keystone component.
Table 90. Events for the Keystone component
ks_failed STATE_CHANGE ERROR FTDC Message: The keystone (HTTPd) process should be {0}, but is {1}.
uploa
d Description: The keystone (HTTPd) process is in an unexpected mode.
ks_ok STATE_CHANGE INFO no Message: The keystone (HTTPd) process as expected, state is {0}.
ks_restart INFO WARNING no Message: The {0} service failed. Trying to recover.
ks_url_warn INFO WARNING no Message: Keystone request {0} returned an unknown result.
ks_warn INFO WARNING no Message: The keystone (HTTPd) process monitoring returned an unknown
result.
ldap_reachable STATE_CHANGE INFO no Message: The external LDAP server {0} is up.
Cause: N/A
ldap_unreachable STATE_CHANGE ERROR no Message: The external LDAP server {0} is unresponsive.
User Action: Verify network connection and check whether that LDAP
server is operational.
postgresql_failed STATE_CHANGE ERROR FTDC Message: The 'postgresql-obj' process should be {0}, but is {1}.
uploa
d Description: The 'postgresql-obj' process is in an unexpected mode.
postgresql_ok STATE_CHANGE INFO no Message: The 'postgresql-obj' process as expected, state is {0}.
postgresql_warn INFO WARNING no Message: The 'postgresql-obj' process monitoring returned an unknown
result.
lroc_buffer_desc_autotune_ti TIP TIP no Message: For optimal LROC tuning, based on average cached data
p block sizes of {0} in previously observed workloads, a value of {1} for
maxBufferDescs configuration on this node is recommended.
Cause: LROC disks are configured, but the LROC daemon is currently idle,
which can be a valid transitional state.
Cause: The result of checking the status of the LROC daemon was OK.
Cause: The result of checking the status of the LROC daemon reported that
it is down.
Cause: The result of checking the status of the LROC daemon reported that
it is in an unknown state.
Description: Local cache disk is defined, but is not configured with LROC.
Cause: The result of checking the configured local cache device was not
OK.
User Action: Check the physical status of the LROC device, and the LROC
configuration by using the mmlsnsd and mmdiag commands.
lroc_disk_found INFO_ADD_ENTITY INFO no Message: The local cache disk {0} was found.
Cause: The result of examining the configured local cache device was OK.
Cause: The result of checking the configured local cache device was not
OK.
User Action: Check the local cache configuration by using the mmlsnsd
and mmdiag commands.
Cause: A local cache disk is not in use, which can be a valid situation.
lroc_sensors_clear STATE_CHANGE INFO no Message: Clear any previous bad GPFSLROC sensor state.
User Action: Enable the perfmon GPFSLROC sensor. Set the period
attribute of the GPFSLROC sensor greater than 0 (default is 10).
Therefore, use the mmperfmon config update GPFSLROC.period=N
command, where 'N' is a natural number greater 0. On the other
hand, you can hide this event by using the mmhealth event hide
lroc_sensors_inactive command.
lroc_sensors_not_configured TIP TIP no Message: The GPFSLROC perfmon sensor is not configured.
lroc_sensors_not_needed TIP TIP no Message: LROC is not configured, but performance sensor GPFSLROC
period is declared.
User Action: Disable the perfmon GPFSLROC sensor. Set the period
attribute of the GPFSLROC sensor to 0. Therefore, use the mmperfmon
config update GPFSLROC.period=0 command. On the other hand,
you can hide this event by using the mmhealth event hide
lroc_sensors_not_needed command.
lroc_set_buffer_desc_tip TIP TIP no Message: This node has LROC devices with a total capacity of {0}
GB. Optimal LROC performance requires setting the 'maxBufferDescs'
configuration option. The value of desired buffer descriptors for this node is
'{1}', based on assumed 4 MB data block size.
Description: Not enough buffer descriptors are available for optimal LROC
performance.
Network events
The following table lists the events that are created for the Network component.
Table 92. Events for the network component
bond_degraded STATE_CHANGE WARNING no Message: Some secondaries of the network bond {0} went down.
bond_down STATE_CHANGE ERROR no Message: All secondaries of the network bond {0} are down.
bond_nic_recognized STATE_CHANGE INFO no Message: Bond NIC {id} was recognized. Children {0}.
Description: The specified network bond NIC was correctly recognized for
usage by IBM Storage Scale.
Cause: The specified network bond NIC is reported in the mmfsadm dump
verbs command.
bond_up STATE_CHANGE INFO no Message: All secondaries of the network bond {0} are working as expected.
expected_file_missing INFO WARNING no Message: The expected configuration or program file {0} was not found.
User Action: Check for the existence of the file. If necessary, then install
required packages.
Cause: The user did not enable verbsRdma by using the mmchconfig
command.
ib_rdma_ext_port_speed_low TIP TIP no Message: InfiniBand RDMA NIC {id} uses a smaller extended port speed
than supported.
Description: The currently active extended link speed is less than the
supported value.
Cause: The currently active extended link speed is less than the supported
value.
User Action: Check the settings of the specified InfiniBand RDMA NIC
(ibportstate).
ib_rdma_ext_port_speed_ok TIP INFO no Message: InfiniBand RDMA NIC {id} uses maximum supported port speed.
Cause: N/A
ib_rdma_libs_found STATE_CHANGE INFO no Message: All checked library files can be found.
Cause: The library files are in the expected directories with expected
names.
Cause: Either the libraries are missing or their path names are wrongly set.
User Action: Check whether the libraries, 'librdmacm', and 'libibverbs' are
installed. Also, check whether they can be found by the names that are
referenced in the mmfsadm test verbs config command.
Cause: Physical state of the specified InfiniBand RDMA NIC is not 'LinkUp'
according to ibstat.
User Action: Check the cabling of the specified InfiniBand RDMA NIC.
Description: The physical link of the specified InfiniBand RDMA NIC is up.
ib_rdma_nic_found INFO_ADD_ENTITY INFO no Message: InfiniBand RDMA NIC {id} was found.
ib_rdma_nic_recognized STATE_CHANGE INFO no Message: InfiniBand RDMA NIC {id} was recognized.
ib_rdma_nic_unrecognized STATE_CHANGE ERROR no Message: InfiniBand RDMA NIC {id} was not recognized.
Cause: The specified InfiniBand RDMA NIC is not reported in the mmfsadm
dump verbs command.
Cause: One of the previously monitored InfiniBand RDMA NICs is not listed
by ibstat anymore.
ib_rdma_port_speed_low STATE_CHANGE WARNING no Message: InfiniBand RDMA NIC {id} uses a smaller port speed than
enabled.
Description: The currently active link speed is lesser than the enabled
maximum link speed.
Cause: The currently active link speed is lesser than the enabled maximum
link speed.
User Action: Check the settings of the specified IB RDMA NIC (ibportstate).
ib_rdma_port_speed_ok STATE_CHANGE INFO no Message: InfiniBand RDMA NIC {id} uses maximum enabled port speed.
Cause: The currently active link speed is equal to the enabled maximum
link speed.
ib_rdma_port_speed_optimal TIP INFO no Message: InfiniBand RDMA NIC {id} uses maximum supported port speed.
ib_rdma_port_speed_subopti TIP TIP no Message: InfiniBand RDMA NIC {id} uses a smaller port speed than
mal supported.
Description: The currently enabled link speed is lesser than the supported
maximum link speed.
Cause: The currently enabled link speed is lesser than the supported
maximum link speed.
User Action: Check the settings of the specified InfiniBand RDMA NIC
(ibportstate).
ib_rdma_port_width_low STATE_CHANGE WARNING no Message: InfiniBand RDMA NIC {id} uses a smaller port width than
enabled.
Description: The currently active link width is lesser than the enabled
maximum link width.
Cause: The currently active link width is lesser than the enabled maximum
link width.
User Action: Check the settings of the specified InfiniBand RDMA NIC
(ibportstate).
ib_rdma_port_width_ok STATE_CHANGE INFO no Message: InfiniBand RDMA NIC {id} uses maximum enabled port width.
Description: The currently active link width equal to the enabled maximum
link width.
Cause: The currently active link width is equal to the enabled maximum
link width.
ib_rdma_port_width_optimal TIP INFO no Message: InfiniBand RDMA NIC {id} uses maximum supported port width.
ib_rdma_port_width_subopti TIP TIP no Message: InfiniBand RDMA NIC {id} uses a smaller port width than
mal supported.
Description: The currently enabled link width is lesser than the supported
maximum link width.
Cause: The currently enabled link width is lesser than the supported
maximum link width.
User Action: Check the settings of the specified IB RDMA NIC (ibportstate).
ib_rdma_ports_ok STATE_CHANGE INFO no Message: verbsPorts is correctly set for InfiniBand RDMA.
ib_rdma_ports_undefined STATE_CHANGE ERROR no Message: No NICs and ports are set up for InfiniBand RDMA.
Cause: The user did not configure verbsPorts by using the mmchconfig
command.
User Action: Set up the NICs and ports to use with the verbsPorts setting
in the mmchconfig command.
ib_rdma_ports_wrong STATE_CHANGE ERROR no Message: verbsPorts is incorrectly set for InfiniBand RDMA.
many_tx_errors STATE_CHANGE ERROR FTDC Message: NIC {0} had many TX errors since the last monitoring cycle.
uploa
d Description: The network adapter had many TX errors since the last
monitoring cycle.
Cause: The '/proc/net/dev' folder lists much more TX errors for this
adapter since the last monitoring cycle.
network_connectivity_down STATE_CHANGE ERROR no Message: NIC {0} cannot connect to the gateway.
User Action: Check the network configuration of the network adapter, path
to the gateway, and gateway itself.
network_connectivity_up STATE_CHANGE INFO no Message: NIC {0} can connect to the gateway.
Cause: A new NIC, which is relevant for the IBM Storage Scale monitoring,
is listed by ip a.
User Action: Find out, why the IBM Storage Scale-relevant IPs were not
assigned to any NICs.
network_ips_partially_down STATE_CHANGE ERROR no Message: Some relevant IPs are not served by found NICs: {0}.
User Action: Find out why the specified IBM Storage Scale-relevant IPs
were not assigned to any NICs.
network_ips_up STATE_CHANGE INFO no Message: Relevant IPs are served by found NICs.
network_link_down STATE_CHANGE ERROR no Message: Physical link of the NIC {0} is down.
Cause: The 'LOWER_UP' flag is not set for this NIC in the output of ip a.
network_link_up STATE_CHANGE INFO no Message: Physical link of the NIC {0} is up.
Cause: The 'LOWER_UP' flag is set for this NIC in the output of ip a.
nic_firmware_not_available STATE_CHANGE WARNING no Message: The expected firmware level of adapter {id} is not available.
nic_firmware_ok STATE_CHANGE INFO no Message: The adapter {id} has the expected firmware level {0}.
nic_firmware_unexpected STATE_CHANGE WARNING no Message: The adapter {id} has firmware level {0} and not the expected
firmware level {1}.
no_tx_errors STATE_CHANGE INFO no Message: NIC {0} had no or a tiny number of TX errors.
rdma_roce_cma_tos TIP TIP no Message: NIC {id} The CMA type of service class is not set to the
recommended value.
Description: The CMA type of service class is not set to the recommended
value.
Cause: The CMA type of service class is not set to the recommended value.
User Action: Check the settings of the specified InfiniBand RDMA NIC
by using the cma_roce_tos command and the system health monitor
configuration file by using the mmsysmonitor.conf file.
rdma_roce_cma_tos_ok STATE_CHANGE INFO no Message: NIC {id} The CMA type of service class is set to the
recommended value.
Cause: The CMA type of service class is set to the recommended value.
rdma_roce_mtu_low TIP TIP no Message: NIC {id} The actual MTU size is less than the maximum MTU size.
Description: The actual MTU size is less than the maximum MTU size.
Cause: The actual MTU size is less than the maximum MTU size.
User Action: Check the MTU settings of the specified NIC by using the
'ibv_devinfo' command.
rdma_roce_mtu_ok STATE_CHANGE INFO no Message: NIC {id} The actual MTU size is OK.
Description: The actual MTU size is set to the maximum MTU size.
Cause: The actual MTU size is set to the maximum MTU size.
rdma_roce_pfc_prio_buffer_b STATE_CHANGE WARNING no Message: NIC {id} The PFC buffer priority class is not set to the
ad recommended value.
Description: The PFC buffer priority class is not set to the recommended
value, which might lead to a significant decrease in performance.
Cause: The PFC buffer priority class is not set to the recommended value.
User Action: Check the settings of the specified InfiniBand RDMA NIC by
using the mlnx_qos command and the system health monitor configuration
file by using the mmsysmonitor.conf file.
rdma_roce_pfc_prio_buffer_o STATE_CHANGE INFO no Message: NIC {id} The PFC buffer priority class is set to the recommended
k value.
Description: The PFC buffer priority class is set to the recommended value.
Cause: The PFC buffer priority class is set to the recommended value.
rdma_roce_pfc_prio_enabled STATE_CHANGE WARNING no Message: NIC {id} The enabled PFC priority class is not set to the
_bad recommended value.
Description: The enabled PFC priority class is not set to the recommended
value, which might lead to a significant decrease in performance.
Cause: The enabled PFC priority class is not set to the recommended
value.
User Action: Check the settings of the specified NIC (mlnx_qos) and the
system health monitor configuration file (mmsysmonitor.conf).
rdma_roce_pfc_prio_enabled STATE_CHANGE INFO no Message: NIC {id} The enabled PFC priority class is set to the
_ok recommended value.
Cause: The enabled PFC priority class is set to the recommended value.
rdma_roce_qos_prio_trust STATE_CHANGE WARNING no Message: NIC {id} The RoCE QoS value for trust is not set to 'dscp'.
Description: The RoCE QoS setting for trust is not set to 'dscp', which might
lead to a significant decrease in performance.
Cause: The RoCE QoS setting for trust is not set to 'dscp'.
User Action: Check the settings of the specified RoCE NIC by using the
'mlnx_qos' command.
rdma_roce_qos_prio_trust_ds STATE_CHANGE INFO no Message: NIC {id} The RoCE QoS setting for trust is set to 'dscp'.
cp
Description: The RoCE QoS setting for trust is set to 'dscp'.
rdma_roce_tclass TIP TIP no Message: NIC {id} The traffic class is not set to the recommended value.
rdma_roce_tclass_ok STATE_CHANGE INFO no Message: NIC {id} The traffic class is set to the recommended value.
Cause: The DBus was detected as down, which might cause several issues
on the local node.
User Action: Stop the NFS service, restart the DBus, and start the NFS
service again.
dbus_error_pod STATE_CHANGE WARNING no Message: {id}: The DBus availability check failed.
Cause: The DBus was detected as down, which might cause several issues
on the local node.
User Action: Stop the NFS service, restart the DBus, and start the NFS
service again.
dir_statd_perm_ok STATE_CHANGE INFO no Message: The permissions of the local NFS statd directory are correct.
Description: The permissions of the local NFS statd directory are correct.
Cause: The permissions of the local NFS statd directory are correct.
dir_statd_perm_problem STATE_CHANGE WARNING no Message: The permissions of the local NFS statd directory might be
incorrect for operation. {0}={1} (reference={2}).
Cause: The permissions of the local NFS statd directory might be incorrect
for operation.
Cause: The user disabled the NFS service by using the mmces service
disable nfs command.
enable_nfs_service INFO_EXTERNAL INFO no Message: The CES NFS service was enabled.
Cause: The user enabled the NFS service by using the mmces service
enable nfs command.
User Action: Restart the NFS service when the root cause for this issue is
solved.
ganeshagrace INFO_EXTERNAL INFO no Message: The CES NFS service is set to a grace mode.
Description: The NFS server is set to a grace mode for a limited time,
which gives the time to previously connected clients to recover their file
locks.
knfs_available_warn STATE_CHANGE WARNING no Message: The kernel NFS service state is not masked or disabled.
Description: The kernel NFS service is available, but there is a risk that it
can be started, and can cause issues.
User Action: Check the NFS setup. The nfs.service should be deactivated
or masked to avoid conflicts with the IBM Storage Scale NFS server.
knfs_disabled_ok STATE_CHANGE INFO no Message: The kernel NFS service is disabled, but it should be masked.
Description: The kernel NFS service is disabled, but there is a risk that it
can be started.
User Action: Check the NFS setup. The nfs.service should be masked to
avoid conflicts with the IBM Storage Scale NFS server.
Description: The kernel NFS service is masked to avoid the service from
being accidentally started.
knfs_running_warn STATE_CHANGE WARNING no Message: The kernel NFS service state is active.
Description: The kernel NFS service is active and can cause conflicts with
the IBM Storage Scale NFS server.
User Action: Check the NFS setup. The nfs.service is either deactivated or
masked to avoid conflicts with the IBM Storage Scale NFS server.
mountd_rpcinfo_ok STATE_CHANGE INFO no Message: The NFS mountd service is listed by rpcinfo.
mountd_rpcinfo_unknown STATE_CHANGE WARNING no Message: The NFS mount service is not listed by rpcinfo.
Cause: The mountd service is not listed by rpcinfo, but expected to run.
Cause: The NFS server is either under high load or hung up, which restricts
the processing of the request.
User Action: Check the health state of the NFS server and restart, if
necessary.
Cause: N/A
Description: The NFS v4 NULL check failed. This check verifies whether
the NFS server reacts to NFS v4 requests. The NFS v4 protocol must be
enabled for this check. If this down state is detected, then further checks
are done to figure out whether the NFS server is still working. If the NFS
server seems not to be working, then a failover is triggered.
Cause: The NFS server is either under high load or hung up, which restricts
the processing of the request.
User Action: Check the health state of the NFS server and restart, if
necessary.
Cause: N/A
nfs_active_pod STATE_CHANGE INFO no Message: {id}: The NFS service is now active.
nfs_dbus_error STATE_CHANGE WARNING no Message: NFS check by using the DBus failed.
Cause: The NFS service is registered on DBus, but there was a problem
while accessing it.
User Action: Check the health state of the NFS service and restart the NFS
service. Check the log files for reported issues.
nfs_dbus_error_pod STATE_CHANGE WARNING no Message: {id}: NFS check by using the DBus failed.
Cause: The NFS service is registered on DBus, but there was a problem
while accessing it.
User Action: Check the health state of the NFS service and restart the NFS
service. Check the log files for reported issues.
nfs_dbus_failed STATE_CHANGE WARNING no Message: NFS check by using the DBus did not return the expected
message.
Cause: The NFS service is registered on DBus, but the check by using the
DBus did not return the expected result.
User Action: Stop the NFS service and start it again. Check the log
configuration of the NFS service.
nfs_dbus_failed_pod STATE_CHANGE WARNING no Message: {id}: NFS check by using the DBus did not return the expected
message.
Cause: The NFS service is registered on DBus, but the check by using the
DBus did not return the expected result.
User Action: Stop the NFS service and start it again. Check the log
configuration of the NFS service.
nfs_dbus_ok STATE_CHANGE INFO no Message: The NFS check by using the DBus is successful.
nfs_dbus_ok_pod STATE_CHANGE INFO no Message: {id}: The NFS check by using the DBus is successful.
nfs_exported_fs_chk STATE_CHANGE_EXTE INFO no Message: The Cluster State Manager (CSM) cleared the
RNAL 'nfs_exported_fs_down' event.
Description: Declared NFS exported file systems are either available again
on this node, or not available on any node.
nfs_exported_fs_down STATE_CHANGE_EXTE ERROR no Message: One or more declared NFS exported file systems are not
RNAL available on this node.
Description: One or more declared NFS exported file systems are not
available on this node. Other nodes might have those file systems available.
Cause: One or more declared NFS exported file systems are not available
on this node.
User Action: Check NFS export-related local and remote file system states.
nfs_exports_clear_state STATE_CHANGE INFO no Message: Clear local NFS export down state temporarily.
Description: Clear local NFS export down state temporarily, because an 'all
nodes have the same problem' message is received.
nfs_exports_down STATE_CHANGE WARNING no Message: One or more declared file systems for NFS exports are not
available.
Description: One or more declared file systems for NFS exports are
unavailable.
Cause: One or more declared file systems for NFS exports are unavailable.
nfs_exports_up STATE_CHANGE INFO no Message: All declared file systems for NFS exports are available.
Description: All declared file systems for NFS exports are available.
Cause: All declared file systems for NFS exports are available.
Description: The monitor detected that CES NFS is in a grace mode. During
this time, the NFS state is shown as degraded.
Description: The monitor detected that CES NFS is in a grace mode. During
this time, the NFS state is shown as degraded.
nfs_not_dbus STATE_CHANGE WARNING no Message: NFS service is unavailable as the DBus service.
Cause: The NFS service might be started while the DBus was down.
User Action: Stop the NFS service, restart the DBus, and start the NFS
service again.
nfs_not_dbus_pod STATE_CHANGE WARNING no Message: {id}: NFS service is unavailable as the DBus service.
Cause: The NFS service might be started while the DBus was down.
User Action: Stop the NFS service, restart the DBus, and start the NFS
service again.
nfs_openConnection INFO WARNING no Message: NFS has invalid open connection to CES IP {0}.
Cause: The NFS server has an open connection to a not existing CES IP.
nfs_rpcinfo_unknown STATE_CHANGE WARNING no Message: The NFS program is not listed by rpcinfo.
Cause: The NFS program is not listed by rpcinfo, but expected to run.
nfs_sensors_active TIP INFO no Message: The NFS perfmon sensor {0} is active.
Description: The NFS perfmon sensors are active. This event's monitor is
running only once an hour.
nfs_sensors_inactive TIP TIP no Message: The NFS perfmon sensor {0} is inactive.
Description: The NFS perfmon sensors are inactive. This event's monitor is
running only once an hour.
User Action: Set the period attribute of the NFS sensors to a value greater
than 0. For more information, use the mmperfmon config update
SensorName.period=N command where 'SensorName' is the name of
a specific NFS sensor and 'N' is a natural number greater than 0. Consider
that this TIP monitor is running only once per hour and it might take up to
one hour to detect the changes in the configuration.
nfs_sensors_not_configured TIP TIP no Message: The NFS perfmon sensor {0} is not configured.
Description: The NFS perfmon sensor does not exist in the mmperfmon
config show command.
Description: A check showed that the CES NFS service, which is supposed
to be running, is unresponsive.
User Action: Restart the CES NFS service when this state persists.
Description: A check showed that the CES NFS service, which is supposed
to be running, is unresponsive.
User Action: Restart the CES NFS service when this state persists.
User Action: Check the health state of the NFS server and restart, if
necessary. The process might get hung or be in a dysfunctional state.
Ensure that the kernel NFS server is not running.
nfsd_no_restart INFO WARNING no Message: NFSD process cannot be restarted. Reason: {0}.
Description: An expected NFS service process was not running, but cannot
be restarted.
Cause: The NFS server process was not detected and cannot be restarted.
User Action: Check the health state of the NFS server and restart, if
necessary. Check the issues, which lead to the unexpected failure. Ensure
that the kernel NFS server is not running.
Cause: The NFS server process was not detected and restarted.
User Action: Check the health state of the NFS server and restart, if
necessary. Check the issues, which lead to the unexpected failure. Ensure
that the kernel NFS server is not running.
Cause: N/A
nfsd_warn INFO WARNING no Message: NFSD process monitoring returned an unknown result.
User Action: Check the health state of the NFS server and restart, if
necessary.
nfsserver_found_pod INFO_ADD_ENTITY INFO no Message: The NFS server {id} was found.
nfsserver_vanished_pod INFO_DELETE_ENTITY INFO no Message: The NFS Server {id} has vanished.
Cause: A NFS server is not in use for an IBM Storage Scale file system,
which can be a valid situation.
nlockmgr_rpcinfo_ok STATE_CHANGE INFO no Message: The NFS nlockmgr service is listed by rpcinfo.
nlockmgr_rpcinfo_unknown STATE_CHANGE WARNING no Message: The NFS nlockmgr service is not listed by rpcinfo.
Cause: The nlockmgr service is not listed by rpcinfo, but expected to run.
User Action: Check whether the portmapper service is running, and if any
services are conflicting with the portmapper service on this system.
User Action: Check whether the portmapper service is running, and if any
services are conflicting with the portmapper service on this system.
Cause: N/A
Cause: N/A
portmapper_warn INFO WARNING no Message: Portmapper port monitoring (111) returned an unknown result.
portmapper_warn_pod INFO WARNING no Message: {id}: Portmapper port monitoring (111) returned an unknown
result.
Cause: CES IP addresses are moved or added to the node, and activated.
rpc_mountd_inv_user STATE_CHANGE WARNING no Message: The mount port {0} does not belong to IBM Storage Scale NFS.
Cause: The mount daemon process does not belong to NFS Ganesha.
User Action: Check the NFS setup and ensure that the kernel NFS is not
activated.
rpc_mountd_ok STATE_CHANGE INFO no Message: The mountd service has the expected user.
rpc_nfs_inv_user STATE_CHANGE WARNING no Message: The NFS port {0} does not belong to IBM Storage Scale NFS.
User Action: Check the NFS setup and ensure that the kernel NFS is not
activated.
rpc_nfs_ok STATE_CHANGE INFO no Message: The NFS service has the expected user.
rpc_nlockmgr_inv_user STATE_CHANGE WARNING no Message: The nlockmgr port {0} does not belong to IBM Storage Scale NFS.
Cause: The lock manager process does not belong to NFS Ganesha.
User Action: Check the NFS setup and ensure that the kernel NFS is not
activated.
rpc_nlockmgr_ok STATE_CHANGE INFO no Message: The nlockmgr service has the expected user.
rpc_rpcinfo_warn INFO WARNING no Message: The rpcinfo check returned an unknown result.
rpcbind_down_pod STATE_CHANGE WARNING no Message: {id}: The rpcbind process is not running.
rpcbind_unresponsive STATE_CHANGE ERROR no Message: The rpcbind process is unresponsive. Attempt to restart.
Description: The rpcbind process does not work. A restart can help.
rpcbind_unresponsive_pod STATE_CHANGE ERROR no Message: {id}: The rpcbind process is unresponsive. Attempt to restart.
Description: The rpcbind process does not work. A restart can help.
Cause: N/A
Cause: N/A
rpcbind_warn STATE_CHANGE WARNING no Message: The rpcbind check failed with an issue.
rpcbind_warn_pod STATE_CHANGE WARNING no Message: {id}: The rpcbind check failed with an issue.
Cause: N/A
Cause: N/A
Cause: The NFS service was started by issuing the mmces service
start nfs command.
User Action: Stop and start the NFS service, which also attempts to start
the statd process.
statd_multiple STATE_CHANGE WARNING no Message: The rpc.statd process is running multiple times.
Cause: The statd process, which is running multiple times, either indicates
an issue with rpcbind or manually starts.
User Action: Stop and start the NFS service. This stops and restarts the
statd processes.
Cause: N/A
Cause: The statd process was not started by NFS startup. The command
line parameter mmstatdcallout is missing or has an unexpected owner.
User Action: Stop and start the NFS service. This stops and restarts the
statd processes.
Cause: The NFS service was stopped by using the mmces service stop
nfs command.
Description: An NVMe controller that was listed in the IBM Storage Scale
configuration was detected.
Cause: N/A
nvme_lbaformat_not_optimal STATE_CHANGE WARNING no Message: The NVMe device {0} does not show expected format.
User Action: Check the NVMe device format for metadata size (expect ms:
0) and relative performance (expt rp: 0).
nvme_lbaformat_ok STATE_CHANGE INFO no Message: The NVMe device {0} shows expected format.
Cause: N/A
nvme_linkstate_not_optimal STATE_CHANGE WARNING no Message: The NVMe device {0} reports a link state that does not match the
capabilities.
Description: The NVMe device does not have optimal link state.
nvme_linkstate_ok STATE_CHANGE INFO no Message: The NVMe device {0} reports a link state that matches the
capabilities.
Cause: N/A
nvme_needsservice STATE_CHANGE WARNING no Message: The NVMe controller {0} needs service.
Cause: N/A
Cause: N/A
nvme_operationalmode_warn STATE_CHANGE WARNING no Message: The NVMe controller {0} encountered either internal errors or
supercap health issues.
Cause: N/A
nvme_readonly_mode STATE_CHANGE WARNING no Message: NVMe controller {0} is moved to read-only mode.
Cause: The device is moved to read-only mode when the power source
does not allow backup or flash spare block count reaches backup, which is
unsupported threshold.
nvme_sparespace_low STATE_CHANGE WARNING no Message: The NVMe controller {0} either indicates program-erase cycles
greater than 90% or supercap end of lifetime is less than or equal to 2
months.
Cause: N/A
nvme_state_inconsistent STATE_CHANGE WARNING no Message: The NVMe controller {0} reports inconsistent state information.
Cause: N/A
nvme_temperature_warn STATE_CHANGE WARNING no Message: NVMe controller {0} reports whether the CPU, System, or
Supercap temperature is greater than or less than the critical threshold
of a component.
Cause: N/A
User Action: Check the system cooling, such as air blocked or fan failed.
Description: An NVMe controller, which was listed in the IBM Storage Scale
configuration, was not detected.
User Action: Run the 'nvme' command to verify that all expected NVMe
adapters exist.
nvmeof_raw_disk_absent STATE_CHANGE WARNING no Message: The NVMeoF disk {id} is expected to be installed, but it is absent.
nvmeof_raw_disk_enabled STATE_CHANGE WARNING no Message: The NVMeoF disk {id} is installed, but not configured.
nvmeof_raw_disk_failed STATE_CHANGE WARNING no Message: The NVMeoF disk {id}, which is not exported to the GNR, reports
an unknown failure.
Description: An NVMeoF disk, which is not exported to the GNR, has failed.
User Action: Check whether the NVMeoF disk is correctly installed. For
more information, see the IBM Storage Scale: Problem Determination Guide
of the relevant system. Contact IBM support if you need more help.
nvmeof_raw_disk_found INFO_ADD_ENTITY INFO no Message: The NVMeoF disk {id}, which is not exported to the GNR, runs as
expected.
Cause: N/A
nvmeof_raw_disk_ok STATE_CHANGE INFO no Message: The NVMeoF disk {id}, which is not exported to the GNR, runs as
expected.
Cause: N/A
nvmeof_raw_disk_smart_faile STATE_CHANGE WARNING servic Message: The NVMeoF disk {id}, which is not eported to the GNR, should
d e be replaced otherwise a malfunction can occur.
ticket
Description: The smart assessment of an NVMeoF disk, which not
exported to the GNR, has failed.
Cause: An NVMeoF disk has a failed smart assessment. This disk is not
exported to the GNR.
User Action: Replace the disk. Contact IBM support if you need more help.
nvmeof_raw_disk_smart_ok STATE_CHANGE INFO no Message: The smart assessment of an NVMeoF disk {id}, which is not
exported to the GNR, returns a healthy report.
Cause: N/A
nvmeof_raw_disk_smart_unk STATE_CHANGE WARNING servic Message: The system is likely updating the status of an NVMeoF disk {id}.
nown e The process should be transient.
ticket
Description: No smart information is received from an NVMeoF disk, which
not exported to the GNR.
Cause: An NVMeoF disk does not report a smart assessment. This disk is
not exported to the GNR.
nvmeof_raw_disk_standby_of STATE_CHANGE WARNING no Message: The NVMeoF disk {id} is set to an offline state by the user.
fline
Description: An NVMeoF disk, which should not be exported to the GNR, is
set to an offline state.
Cause: The hardware monitoring system detects an NVMeoF disk that was
put to an offline state. This disk was not exported to the GNR before.
nvmeof_raw_disk_unavailable STATE_CHANGE WARNING no Message: The NVMeoF disk {id} is set offline by an unknown reason.
_offline
Description: An NVMeoF disk, which should not be exported to the GNR, is
set offline without known reason.
User Action: Check for possible problems like missing power, etc. For more
information, see the IBM Storage Scale: Problem Determination Guide of the
relevant system. Contact IBM support if you need more help.
nvmeof_raw_disk_unknown STATE_CHANGE WARNING no Message: The system is likely updating the status of an NVMeoF disk {id}.
The process should be transient.
nvmeof_raw_disk_vanished INFO_DELETE_ENTITY INFO no Message: An NVMeoF disk, which is in raw mode and was previously
reported, is not detected anymore.
Cause: An NVMeoF disk, which was previously detected in the IBM Storage
Scale configuration, was not found.
User Action: Verify that all expected NVMeoF disks in the raw mode exist in
the IBM Storage Scale configuration.
NVMeoF events
The following table lists the events that are created for the NVMeoF component.
Table 95. Events for the NVMeoF component
nvmeof_devices_missing STATE_CHANGE WARNING no Message: The following devices are configured, but do not exist: {0}.
User Action: Add the missing files or remove the NVMeoF definitions.
nvmeof_devices_not_configur STATE_CHANGE WARNING no Message: The following devices exist, but they are not configured: {0}.
ed
Description: No NVMeoF device is configured in CCR file
nvmeofft_config.json.
nvmeof_devices_ok STATE_CHANGE INFO no Message: No problems found for any NVMeoF devices.
Cause: N/A
nvmeof_module_missing STATE_CHANGE ERROR no Message: NVMeoF kernel modules are missing: {0}.
Cause: The lsmod command reported that at least one NVMeoF kernel
module is missing.
Cause: N/A
Cause: N/A
nvmeof_multipath_enabled STATE_CHANGE WARNING no Message: Native multipath is not disabled for NVMeoF, but disabled
multipath is required.
Cause: N/A
Cause: The rpm -qa command reports that an NVMeoF related package is
missing.
Cause: N/A
nvmeof_target_device_cachin STATE_CHANGE WARNING no Message: NVMeoF target device caching is not write-through for: {0}.
g_wrong
Description: Target device caching for NVMeoF is not write-through, which
might lead to data loss or corruption.
User Action: Set the target device caching for NVMeoF to write- through.
nvmeof_unknown_devices_co STATE_CHANGE WARNING no Message: The following nonexistent devices are listed in CCR file
nfigured nvmeofft_config.json: {0}.
User Action: Remove the detected files from CCR file nvmeofft_config.json.
Object events
The following table lists the events that are created for the Object component.
Important:
• CES Swift Object protocol feature is not supported from IBM Storage Scale 5.1.9 onwards.
• IBM Storage Scale 5.1.8 is the last release that has CES Swift Object protocol.
• IBM Storage Scale 5.1.9 will tolerate the update of a CES node from IBM Storage Scale 5.1.8.
– Tolerate means:
- The CES node will be updated to 5.1.9.
account-auditor_failed STATE_CHANGE ERROR no Message: The account-auditor process should be {0}, but is {1}.
account-auditor_ok STATE_CHANGE INFO no Message: The state of the account-auditor process, as expected, is {0}.
account-auditor_warn INFO WARNING no Message: The account-auditor process monitoring returned an unknown
result.
account-reaper_failed STATE_CHANGE ERROR no Message: The account-reaper process should be {0}, but is {1}.
account-reaper_ok STATE_CHANGE INFO no Message: The state of account-reaper process, as expected, is {0}.
account-reaper_warn INFO WARNING no Message: The account-reaper process monitoring returned an unknown
result.
account-replicator_failed STATE_CHANGE ERROR no Message: The account-replicator process should be {0}, but is {1}.
account-replicator_ok STATE_CHANGE INFO no Message: The state of account-replicator process, as expected, is {0}.
account-replicator_warn INFO WARNING no Message: The account-replicator process monitoring returned an unknown
result.
account-server_failed STATE_CHANGE ERROR no Message: The account process should be {0}, but is {1}.
account-server_ok STATE_CHANGE INFO no Message: The state of account process, as expected, is {0}.
account-server_warn INFO WARNING no Message: The account process monitoring returned an unknown result.
account_access_down STATE_CHANGE ERROR no Message: No access to account service ip {0} and port {1}. Check the
firewall.
User Action: Check whether the account service is running and the firewall
rules.
account_access_up STATE_CHANGE INFO no Message: Access to account service ip {0} and port {1} is OK.
Description: The access check of the account service port was successful.
Cause: N/A
account_access_warn INFO WARNING no Message: Account service access check ip {0} and port {1} failed. Check for
validity.
User Action: Find potential issues for this kind of failure in the logs.
container-auditor_failed STATE_CHANGE ERROR no Message: The container-auditor process should be {0}, but is {1}.
container-auditor_ok STATE_CHANGE INFO no Message: The state of container-auditor process, as expected, is {0}.
container-auditor_warn INFO WARNING no Message: The container-auditor process monitoring returned an unknown
result.
container-replicator_failed STATE_CHANGE ERROR no Message: The container-replicator process should be {0}, but is {1}.
container-replicator_ok STATE_CHANGE INFO no Message: The state of container-replicator process, as expected, is {0}.
container-server_failed STATE_CHANGE ERROR no Message: The container process should be {0}, but is {1}.
container-server_ok STATE_CHANGE INFO no Message: The state of container process, as expected, is {0}.
container-server_warn INFO WARNING no Message: The container process monitoring returned an unknown result.
container-updater_failed STATE_CHANGE ERROR no Message: The container-updater process should be {0}, but is {1}.
container-updater_ok STATE_CHANGE INFO no Message: The state of container-updater process, as expected, is {0}.
container-updater_warn INFO WARNING no Message: The container-updater process monitoring returned an unknown
result.
container_access_down STATE_CHANGE ERROR no Message: No access to container service ip {0} and port {1}. Check the
firewall.
User Action: Check whether the file system daemon is running and the
firewall rules.
container_access_up STATE_CHANGE INFO no Message: Access to container service ip {0} and port {1} is OK.
Cause: N/A
container_access_warn INFO WARNING no Message: Container service access check ip {0} and port {1} failed. Check
for validity.
User Action: Find potential issues for this kind of failure in the logs.
Cause: A CES IP with a singleton or database flag, which is linked to it, was
either removed from or moved to this node.
Cause: A CES IP with a singleton or database flag, which is linked to it, was
either removed from or moved to this node.
Cause: A CES IP with a singleton or database flag, which is linked to it, was
either removed from or moved to this node.
Cause: A CES IP with a singleton or database flag, which is linked to it, was
either removed from or moved to this node.
ibmobjectizer_failed STATE_CHANGE ERROR no Message: The ibmobjectizer process should be {0}, but is {1}.
ibmobjectizer_ok STATE_CHANGE INFO no Message: The state of ibmobjectizer process, as expected, is {0}.
ibmobjectizer_warn INFO WARNING no Message: The ibmobjectizer process monitoring returned an unknown
result.
memcached_failed STATE_CHANGE ERROR no Message: The memcached process should be {0}, but is {1}.
memcached_ok STATE_CHANGE INFO no Message: The state of memcached process, as expected, is {0}.
memcached_warn INFO WARNING no Message: The memcached process monitoring returned an unknown
result.
obj_restart INFO WARNING no Message: The {0} service failed. Trying to recover.
object-expirer_failed STATE_CHANGE ERROR no Message: The object-expirer process should be {0}, but is {1}.
object-expirer_ok STATE_CHANGE INFO no Message: The state of object-expirer process, as expected, is {0}.
object-expirer_warn INFO WARNING no Message: The object-expirer process monitoring returned an unknown
result.
object-replicator_failed STATE_CHANGE ERROR no Message: The object-replicator process should be {0}, but is {1}.
object-replicator_ok STATE_CHANGE INFO no Message: The state of object-replicator process, as expected, is {0}.
object-replicator_warn INFO WARNING no Message: The object-replicator process monitoring returned an unknown
result.
object-server_failed STATE_CHANGE ERROR no Message: The object process should be {0}, but is {1}.
object-server_ok STATE_CHANGE INFO no Message: The state of object process, as expected, is {0}.
object-server_warn INFO WARNING no Message: The object process monitoring returned an unknown result.
object-updater_failed STATE_CHANGE ERROR no Message: The object-updater process should be {0}, but is {1}.
object-updater_ok STATE_CHANGE INFO no Message: The state of object-updater process, as expected, is {0}.
object-updater_warn INFO WARNING no Message: The object-updater process monitoring returned an unknown
result.
object_access_down STATE_CHANGE ERROR no Message: No access to object store ip {0} and port {1}. Check the firewall.
User Action: Check whether the object service is running and the firewall
rules.
object_access_up STATE_CHANGE INFO no Message: Access to object store ip {0} and port {1} is OK.
Description: The access check to the object service port was successful.
Cause: N/A
object_access_warn INFO WARNING no Message: Object store access check ip {0} and port {1} failed.
User Action: Find potential issues for this kind of failure in the logs.
object_quarantined INFO_EXTERNAL WARNING no Message: The object "{0}", container "{1}", account "{2}" is quarantined.
Path of quarantined object: "{3}".
object_sof_access_down STATE_CHANGE ERROR no Message: No access to unified object store ip {0} and port {1}. Check the
firewall.
Description: The access check to the unified object service port failed.
User Action: Check whether the unified object service is running and the
firewall rules.
object_sof_access_up STATE_CHANGE INFO no Message: Access to unified object store ip {0} and port {1} is OK.
Description: The access check to the unified object service port was
successful.
Cause: N/A
object_sof_access_warn INFO WARNING no Message: Unified object store access check ip {0} and port {1} failed. Check
for validity.
Description: The access check to the object unified access service port
returned an unknown result.
Cause: The unified object service port access cannot be determined due to
a problem.
User Action: Find potential issues for this kind of failure in the logs.
openstack-object-sof_failed STATE_CHANGE ERROR no Message: The object-sof process should be {0}, but is {1}.
openstack-object-sof_ok STATE_CHANGE INFO no Message: The state of object-sof process, as expected, is {0}.
openstack-object-sof_warn INFO INFO no Message: The object-sof process monitoring returned an unknown result.
openstack-swift-object- STATE_CHANGE ERROR no Message: The object-auditor process should be {0}, but is {1}.
auditor_failed
Description: The object-auditor process is not in the expected state.
openstack-swift-object- STATE_CHANGE INFO no Message: The state of object-auditor process, as expected, is {0}.
auditor_ok
Description: The object-auditor process is in the expected state.
openstack-swift-object- INFO INFO no Message: The object-auditor process monitoring returned an unknown
auditor_warn result.
Cause: N/A
proxy-httpd-server_failed STATE_CHANGE ERROR no Message: The proxy process should be {0}, but is {1}.
proxy-httpd-server_ok STATE_CHANGE INFO no Message: The state of proxy process, as expected, is {0}.
proxy-httpd-server_warn INFO WARNING no Message: The proxy process monitoring returned an unknown result.
Cause: A status query for HTTPd, which runs the proxy server, returned an
unexpected error.
proxy-server_failed STATE_CHANGE ERROR FTDC Message: The proxy process should be {0}, but is {1}.
uploa
d Description: The proxy-server process is not running.
proxy-server_ok STATE_CHANGE INFO no Message: The state of proxy process, as expected, is {0}.
proxy-server_warn INFO WARNING no Message: The proxy process monitoring returned an unknown result.
proxy_access_down STATE_CHANGE ERROR no Message: No access to proxy service ip {0} and port {1}. Check the firewall.
User Action: Check whether the proxy service is running and the firewall
rules.
proxy_access_up STATE_CHANGE INFO no Message: Access to proxy service ip {0}, port {1} is OK.
Description: The access check of the proxy service port was successful.
Cause: N/A
proxy_access_warn INFO WARNING no Message: Proxy service access check ip {0} and port {1} failed. Check for
validity.
User Action: Find potential issues for this kind of failure in the logs.
ring_checksum_failed STATE_CHANGE ERROR FTDC Message: Checksum of ring file {0} does not match the one in CCR.
uploa
d Description: Files for object rings were modified unexpectedly.
Cause: Checksum of ring file did not match the stored value.
ring_checksum_warn INFO WARNING no Message: Issue while checking checksum of ring file is {0}.
User Action: Check whether the ring files and md5sum are executable.
Cause: The OBJECT service was started by using the mmces service
start obj command.
Cause: The OBJECT service was stopped by using the mmces service
stop obj command.
Performance events
The following table lists the events that are created for the Performance component.
Table 97. Events for the Performance component
pmcollector_down STATE_CHANGE ERROR no Message: The pmcollector service should be {0}, and is {1}.
pmcollector_port_down STATE_CHANGE ERROR no Message: Performance monitoring collector port {id} ({0}) is not
responding.
User Action: Check the pmcollector process and logs and verify that it runs
correctly.
pmcollector_port_up STATE_CHANGE INFO no Message: Performance monitoring collector port {0} is responding.
pmcollector_port_warn INFO INFO no Message: The pmcollector service port monitor returned an unknown
result.
User Action: Check the pmcollector process and logs and verify that it runs
correctly.
pmcollector_up STATE_CHANGE INFO no Message: The state of pmcollector service, as expected, is {0}.
pmcollector_warn INFO INFO no Message: The pmcollector service has returned an unknown result.
pmsensors_down STATE_CHANGE ERROR no Message: The pmsensors service should be {0}, but is {1}.
pmsensors_up STATE_CHANGE INFO no Message: The state of pmsensors service, as expected, is {0}.
pmsensors_warn INFO INFO no Message: The pmsensors service returned an unknown result.
raid_adapter_clear STATE_CHANGE INFO no Message: No server RAID data was listed in the output of the test program.
Description: No server RAID data was listed in the output of the test
program /sbin/iprconfig.
Cause: No server RAID data was listed in the output of the test program.
raid_check_warn INFO WARNING no Message: The disk states of the mirrored root partition cannot be
determined.
Description: The server RAID test program failed or ran into a timeout.
Cause: The server RAID test program /sbin/iprconfig failed or ran into
a timeout.
raid_root_disk_bad STATE_CHANGE WARNING no Message: {id} Mirrored root partition disk failed.
raid_root_disk_ok STATE_CHANGE INFO no Message: {id} Mirrored root partition disk is OK.
Description: The disks of the mirrored (RAID 10) root partition are OK.
Cause: The disks of the mirrored (RAID 10) root partition are OK.
raid_sas_adapter_bad STATE_CHANGE WARNING no Message: IBM Power RAID adapter {0} is degraded, which impacts small
write performance.
User Action: Check the RAID adapter card. For more information, execute
the /sbin/iprconfig -c show-arrays command.
raid_sas_adapter_ok STATE_CHANGE INFO no Message: IBM Power RAID adapter {0} is OK.
SMB events
The following table lists the events that are created for the SMB component.
Table 99. Events for the SMB component
ctdb_down STATE_CHANGE ERROR FTDC Message: The CTDB process is not running.
uploa
d Description: The CTDB process is not running.
Cause: N/A
Cause: N/A
Cause: N/A
Cause: N/A
Cause: N/A
Cause: N/A
Cause: The CTDB service successfully passed the version check on a node
and can join the running cluster. The node is given.
ctdb_version_mismatch STATE_CHANGE_EXTE ERROR FTDC Message: Cannot start CTDB version {0} as {1} is already running in the
RNAL uploa cluster.
d
Description: CTDB cannot start on a node as it detected that on other CES
nodes a CTDB cluster is running at a different version. This prevents the
SMB service to get healthy.
ctdb_warn INFO WARNING no Message: The CTDB monitoring returned an unknown result.
Cause: N/A
smb_exported_fs_chk STATE_CHANGE_EXTE INFO no Message: The Cluster State Manager (CSM) cleared the
RNAL smb_exported_fs_down event.
Description: Declared SMB exported file systems are either available again
on this node or not available on any node.
smb_exported_fs_down STATE_CHANGE_EXTE ERROR no Message: One or more declared SMB exported file systems are not
RNAL available on this node.
Description: One or more declared SMB exported file systems are not
available on this node. Other nodes might have those file systems available.
Cause: One or more declared SMB exported file systems are not available
on this node.
User Action: Check the SMB export-related local and remote file system
states.
smb_exports_clear_state STATE_CHANGE INFO no Message: Clear local SMB export down state temporarily.
smb_exports_down STATE_CHANGE WARNING no Message: One or more declared file systems for SMB exports are not
available.
Description: One or more declared file systems for SMB exports are not
available.
Cause: One or more declared file systems for SMB exports are not
available.
smb_exports_up STATE_CHANGE INFO no Message: All declared file systems for SMB exports are available.
Description: All declared file systems for SMB exports are available.
Cause: All declared file systems for SMB exports are available.
smb_restart INFO WARNING no Message: The SMB service failed. Trying to recover.
smb_sensors_active TIP INFO no Message: The SMB perfmon sensor {0} is active.
Description: The SMB perfmon sensors are active. This event's monitor is
running only once an hour.
smb_sensors_inactive TIP TIP no Message: The following SMB perfmon sensor {0} is inactive.
Description: The SMB perfmon sensors are inactive. This event's monitor is
running only once an hour.
User Action: Set the period attribute of the SMB sensors greater than
0. Run the mmperfmon config update SensorName.period=N
command, where 'SensorName' is one of the SMB sensors' name and 'N'
is a natural number, that is greater than 0. Consider that this TIP monitor
is running only once per hour and might take up to 1 hour to detect the
changes in the configuration.
smb_sensors_not_configured TIP TIP no Message: The SMB perfmon sensor {0} is not configured.
Description: The SMB perfmon sensor does not exist in the mmperfmon
config show command.
Cause: N/A
Cause: N/A
smbd_warn INFO WARNING no Message: The SMBD process monitoring returned an unknown result.
Cause: N/A
Cause: N/A
Cause: N/A
smbport_warn INFO WARNING no Message: The SMB port monitoring {0} returned an unknown result.
Cause: N/A
Cause: The SMB service was started by using the mmces service start
smb command.
Cause: The SMB service was stopped by using the mmces service stop
smb command.
site_degraded_replication STATE_CHANGE WARNING no Message: Replication issues are reported at site {id}.
User Action: Check the health of the site recovery group and take any
corrective action, such as issuing the mmrestripefs command.
Cause: N/A
site_fs_desc_fail STATE_CHANGE ERROR no Message: Site {id} has no descriptor disks for all defined file systems.
Description: All file systems at the site have failure groups with no
descriptor disks.
User Action: Check the health of the file system descriptor disks at the site
and ensure that they are working properly on all nodes.
site_fs_desc_ok STATE_CHANGE INFO no Message: Site {id} file system descriptor disk health is OK.
Cause: N/A
site_fs_desc_warn STATE_CHANGE WARNING no Message: Site {id} file system {0} has no descriptor disks in failure groups
{1}.
Description: One or more file systems have descriptor disks that are
missing in the failure groups.
User Action: Check the health of the file system descriptor disks at the site
and ensure that they are working properly on all nodes.
site_fs_down STATE_CHANGE ERROR no Message: File system {0} is down or unavailable at site {id}.
User Action: Check the health of the file system at the site and ensure that
it is properly mounted on all nodes.
site_fs_ok STATE_CHANGE INFO no Message: Site {id} file system health is OK.
Cause: N/A
site_fs_quorum_fail STATE_CHANGE ERROR no Message: Site {id} file system {0} does not have enough healthy descriptor
disks for quorum.
Cause: The file system at the site does not have enough healthy descriptor
disks for quorum.
User Action: Check the health state of disks, which are declared as
descriptor disks for the file system, to prevent potential data loss. For more
information, see the Disk issues section in the IBM Storage Scale: Problem
Determination Guide.
site_fs_warn STATE_CHANGE WARNING no Message: Site {id} has {0} nodes that face file system issues with {1}.
Description: Many nodes face file system events at the site, which indicate
network, resource, or configuration issues.
User Action: Check the health of the file system at the site and ensure that
it is properly mounted on all nodes.
Cause: N/A
site_gpfs_warn STATE_CHANGE WARNING no Message: Site {id} has {0} nodes that are facing GPFS unavailable health
events.
Description: Many nodes are facing GPFS unavailable events at the site,
which might indicate network, resource, or configuration issues.
Cause: Many nodes have reported GPFS unavailable events at the site.
site_heartbeats_degraded STATE_CHANGE WARNING no Message: Site {id} has {0} nodes with missing heartbeat health events.
Description: Many nodes face missing heartbeat events at the site, which
might indicate network, resource, or configuration issues.
Cause: N/A
site_missing_heartbeats STATE_CHANGE ERROR no Message: Heartbeats are missing from site {id}.
Description: Heartbeats are missing from the site, which might indicate
network, resource, or configuration issues.
Cause: N/A
User Action: Check the health of the GPFS quorum state by using the
mmgetstate command and take corrective actions.
site_quorum_error STATE_CHANGE ERROR no Message: Site {id} is experiencing quorum issues with site {0}.
Description: Site nodes are unable to contact the quorum nodes at another
site.
User Action: Check the health of the GPFS quorum state by using the
mmgetstate command and take corrective actions.
Cause: N/A
Cause: N/A
site_vanished INFO_DELETE_ENTITY INFO no Message: Site {id} is no longer configured as a stretch cluster site node.
Cause: N/A
tct_account_active STATE_CHANGE INFO no Message: Cloud provider account, which is configured with Transparent
Cloud Tiering service, is active.
Cause: N/A
tct_account_bad_req STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because of a bad request error.
User Action: For more information, check the trace messages and error
logs.
tct_account_certinvalidpath STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because it was unable to find a valid certification path.
User Action: For more information, check the trace messages and error
logs.
tct_account_configerror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering refuses to connect to the cloud
provider.
tct_account_configured STATE_CHANGE WARNING no Message: Cloud provider account is configured with Transparent Cloud
Tiering, but the service is down.
tct_account_connecterror STATE_CHANGE ERROR no Message: An error occurred while attempting to connect a socket to cloud
provider URL.
User Action: Check whether the cloud provider hostname and port
numbers are valid.
tct_account_containecreatere STATE_CHANGE ERROR no Message: The cloud provider container creation failed.
rror
Description: The cloud provider container creation failed.
User Action: For more information, check the trace messages and
error logs. Also, check whether the account creates issues. For more
information, see the 'Transparent Cloud Tiering issues' section in the
Problem Determination Guide.
tct_account_dbcorrupt STATE_CHANGE ERROR no Message: The database of Transparent Cloud Tiering service is corrupted.
User Action: For more information, check the trace messages and error
logs. Use the mmcloudgateway files rebuildDB command to repair
it.
tct_account_direrror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering failed because one of its internal
directories is not found.
User Action: For more information, check the trace messages and error
logs.
tct_account_invalidcredential STATE_CHANGE ERROR no Message: The cloud provider account credentials are invalid.
s
Description: The Transparent Cloud Tiering service failed to connect to
cloud provider because the authentication failed.
Description: The reason can be because of 'HTTP 404 Not Found' error.
Cause: The reason can be because of 'HTTP 404 Not Found' error.
tct_account_lkm_down STATE_CHANGE ERROR no Message: The local key manager, which is configured for Transparent Cloud
Tiering is not found or corrupted.
User Action: For more information, check the trace messages and error
logs.
tct_account_manyretries INFO WARNING no Message: Transparent Cloud Tiering service faced too many internal
retries.
User Action: For more information, check the trace messages and error
logs.
tct_account_network_down STATE_CHANGE ERROR no Message: The network connection to the Transparent Cloud Tiering node is
down.
User Action: For more information, check the trace messages and error
logs. Also, check whether the network connection is valid.
tct_account_noroute STATE_CHANGE ERROR no Message: The response from cloud provider is invalid.
tct_account_notconfigured STATE_CHANGE WARNING no Message: Transparent Cloud Tiering is not configured with the cloud
provider account.
Description: The Transparent Cloud Tiering is not configured with the cloud
provider account.
tct_account_preconderror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because of a precondition failed error.
User Action: For more information, check the trace messages and error
logs.
tct_account_rkm_down STATE_CHANGE ERROR no Message: The remote key manager, which is configured for Transparent
Cloud Tiering, is inaccessible.
Cause: The Transparent Cloud Tiering fails to connect to the IBM Security
Key Lifecycle Manager.
User Action: For more information, check the trace messages and error
logs.
tct_account_servererror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because of a cloud provider service encounters an unavailability error.
User Action: For more information, check the trace messages and error
logs.
tct_account_sockettimeout STATE_CHANGE ERROR no Message: Timeout occurred on a socket while connecting to the cloud
provider.
User Action: For more information, check the trace messages and error
log. Also, check whether the network connection is valid.
tct_account_sslbadcert STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because of a bad SSL certificate.
User Action: For more information, check the trace messages and error
logs.
tct_account_sslcerterror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering failed to connect to the cloud provider
because of an untrusted server certificate chain.
User Action: For more information, check the trace messages and error
logs.
tct_account_sslerror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because of an error, which is found in the SSL subsystem.
User Action: For more information, check the trace messages and error
logs.
tct_account_sslhandshakeerr STATE_CHANGE ERROR no Message: The cloud account status failed due to an unknown SSL
or handshake error.
Cause: TCT and cloud provider cannot negotiate the desired level of
security.
User Action: For more information, check the trace messages and error
logs.
tct_account_sslhandshakefail STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
ed because they cannot negotiate the desired level of security.
User Action: For more information, check the trace messages and error
logs.
tct_account_sslinvalidalgo STATE_CHANGE ERROR no Message: Transparent Cloud Tiering failed to connect to the cloud provider
because of invalid SSL algorithm parameters.
User Action: For more information, check the trace messages and error
logs.
tct_account_sslinvalidpaddin STATE_CHANGE ERROR no Message: Transparent Cloud Tiering failed to connect to the cloud provider
g because of invalid SSL padding.
User Action: For more information, check the trace messages and error
logs.
tct_account_sslkeyerror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because of a bad SSL key or misconfiguration.
User Action: For more information, check the trace messages and error
logs.
tct_account_sslnocert STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because no certificate is available.
User Action: For more information, check the trace messages and error
logs.
tct_account_sslnottrustedcer STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
t because of an untrusted SSL server certificate.
User Action: For more information, check the trace messages and error
logs.
tct_account_sslpeererror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because its identity cannot be verified.
User Action: For more information, check the trace messages and error
logs.
tct_account_sslprotocolerror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because an error is found during the SSL protocol operation.
User Action: For more information, check the trace messages and error
logs.
tct_account_sslscoketclosed STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because a remote host closed the connection during a handshake.
User Action: For more information, check the trace messages and error
logs.
tct_account_sslunknowncert STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
because of an unknown SSL certificate.
User Action: For more information, check the trace messages and error
logs.
tct_account_sslunrecognized STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud provider
msg because of an unrecognized SSL message.
User Action: For more information, check the trace messages and error
logs.
tct_account_timeskewerror STATE_CHANGE ERROR no Message: The time, which is observed on the Transparent Cloud Tiering
service node, is not in sync with the time on the target cloud provider.
tct_account_unknownerror STATE_CHANGE ERROR no Message: The cloud provider account is inaccessible due to an unknown
error.
User Action: For more information, check the trace messages and error
logs.
User Action: For more information, check trace messages, error log, and
DNS settings.
tct_container_alreadyexists STATE_CHANGE ERROR no Message: The cloud provider container creation failed as it already exists.
CSAP/Container pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_container_creatererror STATE_CHANGE ERROR no Message: The cloud provider container creation failed. CSAP/Container pair
set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_container_limitexceeded STATE_CHANGE ERROR no Message: The cloud provider container creation failed as it exceeded the
maximum limit. CSAP/Container pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_container_notexists STATE_CHANGE ERROR no Message: The cloud provider container does not exist. CSAP/Container pair
set: {id}.
User Action: Check the cloud provider to verify whether the container
exists.
tct_csap_access_denied STATE_CHANGE ERROR no Message: Cloud storage access point failed due to an authorization error.
CSAP/Container pair set: {id}.
tct_csap_bad_req STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of a bad request error. CSAP/Container pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_csap_base_removed INFO_DELETE_ENTITY INFO no Message: CSAP {0} was deleted or converted to a CSAP or container pair.
tct_csap_certinvalidpath STATE_CHANGE ERROR no Message: Transparent Cloud Tiering failed to connect cloud storage access
point because it could not find a valid certification path. CSAP/Container
pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_csap_configerror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering refused to connect to the cloud
storage access point. CSAP/Container pair set: {id}.
tct_csap_connecterror STATE_CHANGE ERROR no Message: An error occurred while attempting to connect a socket to the
cloud storage access point URL. CSAP/Container pair set: {id}.
User Action: Check whether the cloud storage access point hostname and
port numbers are valid.
tct_csap_dbcorrupt STATE_CHANGE ERROR no Message: The database of Transparent Cloud Tiering service is corrupted.
CSAP/Container pair set: {id}.
tct_csap_forbidden STATE_CHANGE ERROR no Message: Cloud storage access point failed with an authorization error.
CSAP/Container pair set: {id}.
tct_csap_found INFO_ADD_ENTITY INFO no Message: CSAP or container pair {0} was found.
tct_csap_invalidcredentials STATE_CHANGE ERROR no Message: The cloud storage access point account {0} credentials are
invalid. CSAP/Container pair set: {id}.
Cause: Cloud storage access point account credentials are either changed
or expired.
tct_csap_invalidurl STATE_CHANGE ERROR no Message: Cloud storage access point URL is invalid. CSAP/Container pair
set: {id}.
tct_csap_lkm_down STATE_CHANGE ERROR no Message: The local key manager, which is configured for Transparent Cloud
Tiering, is not found or corrupted. CSAP/Container pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_csap_malformedurl STATE_CHANGE ERROR no Message: Cloud storage access point URL is malformed. CSAP/Container
pair set: {id}.
tct_csap_noroute STATE_CHANGE ERROR no Message: The response from cloud storage access point is invalid. CSAP/
Container pair set: {id}.
Cause: The cloud storage access point URL returns a response code '-1'.
User Action: Check whether the cloud storage access point URL is
accessible.
tct_csap_online STATE_CHANGE INFO no Message: Cloud storage access point, which is configured with Transparent
Cloud Tiering service, is active. CSAP/Container pair set: {id}.
Cause: N/A
tct_csap_preconderror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of a precondition failed error. CSAP/Container pair
set: {id}.
Cause: Cloud storage access point URL returns an 'HTTP 412 Precondition
Failed' error message.
User Action: For more information, check the trace messages and error
log.
tct_csap_rkm_down STATE_CHANGE ERROR no Message: The remote key manager, which is configured for Transparent
Cloud Tiering, is inaccessible. CSAP/Container pair set: {id}.
Cause: The Transparent Cloud Tiering failed to connect to the IBM Security
Key Lifecycle Manager.
User Action: For more information, check the trace messages and error
log.
tct_csap_servererror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of cloud storage access point service encounters an
unavailability error. CSAP/Container pair set:{id}.
Cause: Cloud storage access point returned an 'HTTP 503 Server' error
message.
User Action: For more information, check the trace messages and error
log.
tct_csap_sockettimeout STATE_CHANGE ERROR no Message: Timeout occurred on a socket while connecting to the cloud
storage access point URL. CSAP/Container pair set: {id}.
User Action: For more information, check the trace messages and error
log. Also, check whether the network connection is valid.
tct_csap_sslbadcert STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of a bad SSL certificate. CSAP/Container pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_csap_sslcerterror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of an untrusted server certificate chain. CSAP/
Container pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_csap_sslerror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of an error that is found in the SSL subsystem. CSAP/
Container pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_csap_sslhandshakeerror STATE_CHANGE ERROR no Message: The cloud storage access point status failed due to an unknown
SSL handshake error. CSAP/Container pair set: {id}.
Cause: Transparent Cloud Tiering and cloud storage access point cannot
negotiate the desired level of security.
User Action: For more information, check the trace messages and error
log.
tct_csap_sslhandshakefailed STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because they cannot negotiate the desired level of security.
CSAP/Container pair set: {id}.
Cause: Transparent Cloud Tiering and cloud storage access point cannot
negotiate the desired level of security.
User Action: For more information, check the trace messages and error
log.
tct_csap_sslinvalidalgo STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of invalid SSL algorithm parameters. CSAP/Container
pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_csap_sslinvalidpadding STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of invalid SSL padding. CSAP/Container pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_csap_sslkeyerror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of a bad SSL key or misconfiguration. CSAP/Container
pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_csap_sslnocert STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because no certificate is available. CSAP/Container pair set:
{id}.
User Action: For more information, check the trace messages and error
log.
tct_csap_sslnottrustedcert STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of an untrusted server certificate. CSAP/Container
pair set: {id}.
Cause: The cloud storage access point server SSL certificate is untrusted.
User Action: For more information, check the trace messages and error
log.
tct_csap_sslpeererror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because its identity cannot be verified. CSAP/Container pair
set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_csap_sslprotocolerror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of an error that is found in the SSL protocol operation.
CSAP/Container pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_csap_sslscoketclosed STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because the remote host closed the connection during a
handshake. CSAP/Container pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_csap_sslunknowncert STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of an unknown SSL certificate. CSAP/Container pair
set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_csap_sslunrecognizedmsg STATE_CHANGE ERROR no Message: Transparent Cloud Tiering fails to connect to the cloud storage
access point because of an unrecognized SSL message. CSAP/Container
pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_csap_timeskewerror STATE_CHANGE ERROR no Message: The time, which is observed on the Transparent Cloud Tiering
service node, is not in sync with the time on target cloud storage access
point. CSAP/Container pair set: {id}.
tct_csap_toomanyretries INFO WARNING no Message: Transparent Cloud Tiering service experienced too many internal
retries. CSAP/Container pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_csap_unknownerror STATE_CHANGE ERROR no Message: The cloud storage access point account is inaccessible due to an
unknown error. CSAP/Container pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_csap_unreachable STATE_CHANGE ERROR no Message: Cloud storage access point URL is unreachable. CSAP/Container
pair set: {id}.
User Action: For more information, check the trace messages, error logs,
and DNS settings.
tct_dir_corrupted STATE_CHANGE ERROR no Message: The directory of Transparent Cloud Tiering service is corrupted.
CSAP/Container pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_fs_configured STATE_CHANGE INFO no Message: The Transparent Cloud Tiering is configured with the file system.
Cause: N/A
tct_fs_corrupted STATE_CHANGE ERROR no Message: The file system {0} of Transparent Cloud Tiering service is
corrupted. CSAP/Container pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_fs_notconfigured STATE_CHANGE WARNING no Message: The Transparent Cloud Tiering is not configured with the file
system.
Description: The Transparent Cloud Tiering is not configured with the file
system.
Cause: The Transparent Cloud Tiering is installed, but the file system is not
configured or deleted.
User Action: Free up disk space on the file system where Transparent
Cloud Tiering is installed.
tct_internal_direrror STATE_CHANGE ERROR no Message: Transparent Cloud Tiering failed because one of its internal
directories is not found. CSAP/Container pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_km_error STATE_CHANGE ERROR no Message: The key manager, which is configured for Transparent Cloud
Tiering, is not found or corrupted. CSAP/Container pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_network_interface_down STATE_CHANGE ERROR no Message: The network of Transparent Cloud Tiering node is down. CSAP/
Container pair set: {id}.
User Action: For more information, check the trace messages and error
log. Also, check whether the network connection is valid.
tct_only_ensure STATE_CHANGE INFO no Message: Transparent Cloud Tiering container is available on cloud, but
does not guarantee that migrate operations would work. Container pair set:
{id}.
User Action: For more information, check the trace messages and error
log.
tct_resourcefile_notfound STATE_CHANGE ERROR no Message: Transparent Cloud Tiering failed because resource address file is
not found. CSAP/Container pair set: {id}.
User Action: For more information, check the trace messages and error
log.
tct_rootdir_notfound STATE_CHANGE ERROR no Message: Transparent Cloud Tiering failed because its container pair root
directory is not found. Container pair set: {id}.
Cause: Transparent Cloud Tiering failed because its container pair root
directory is not found.
User Action: For more information, check the trace messages and error
log.
Cause: The Transparent Cloud Tiering is not configured or its service is not
started.
User Action: Set up the Transparent Cloud Tiering and start its service.
tct_service_restart INFO WARNING no Message: The cloud gateway service failed. Trying to recover.
Cause: N/A
tct_service_warn INFO WARNING no Message: The cloud gateway monitoring returned an unknown result.
Cause: N/A
Threshold events
The following table lists the events that are created for the threshold component.
Table 102. Events for the threshold component
activate_afm_inqueue_rule INFO INFO no Message: Detected AFM Gateway node {id}. Enabled AFM In Queue
Memory rule for threshold monitoring.
activate_smb_default_rules INFO INFO no Message: Detected new SMB exports. The SMBGlobalStats sensor on node
{id} is configured. Enable SMBConnPerNode_Rule and SMBConnTotal_Rule
for threshold monitoring.
Cause: Detected new SMB exports, and check the default rules that are
required to be enabled.
thresh_monitor_del_active INFO_DELETE_ENTITY INFO no Message: The threshold monitoring process is not more running in ACTIVE
state on the local node.
thresh_monitor_lost_active INFO INFO no Message: The pmcollector on node {id} has lost the active role of the
threshold monitoring.
Cause: A pmcollector node lost the active role of the threshold monitoring.
thresh_monitor_set_active INFO_ADD_ENTITY INFO no Message: The threshold monitoring process is running in ACTIVE state on
the local node.
thresholds_error STATE_CHANGE ERROR no Message: The value of {1} for the component(s) {id} exceeded the
threshold error level {0} defined in {2}.
thresholds_no_data STATE_CHANGE INFO no Message: The value of {1} for the component(s) {id}, which is defined in {2},
returns no data.
thresholds_normal STATE_CHANGE INFO no Message: The value of {1} defined in {2} for component {id} reached a
normal level.
thresholds_removed STATE_CHANGE INFO no Message: The value of {1} for the component(s) {id}, which was defined in
{2}, was removed.
thresholds_warn STATE_CHANGE WARNING no Message: The value of {1} for the component(s) {id} exceeded the
threshold warning level {0} defined in {2}.
Watchfolder events
The following table lists the events that are created for the Watchfolder component.
Table 103. Events for the Watchfolder component
watchc_service_failed STATE_CHANGE ERROR no Message: Watchfolder consumer {1} for file system {0} is not running.
Cause: N/A
watchc_service_ok STATE_CHANGE INFO no Message: Watchfolder consumer service for file system {0} is running.
Cause: N/A
watchc_warn STATE_CHANGE WARNING no Message: Warning is encountered in watchfolder consumer for file system
{0}.
Cause: N/A
watchconduit_err STATE_CHANGE ERROR no Message: {0} error: {1} is encountered in GPFS Watch Conduit for watch id
{id}.
Cause: N/A
watchconduit_found INFO_ADD_ENTITY INFO no Message: Watch conduit for watch id {id} was found.
Cause: N/A
watchconduit_ok STATE_CHANGE INFO no Message: GPFS Watch Conduit for watch id {id} is running.
Cause: N/A
watchconduit_replay_done STATE_CHANGE INFO no Message: Conduit has finished replaying {0} events for watch id {id}.
Cause: N/A
watchconduit_resume STATE_CHANGE INFO no Message: Conduit has finished producing {0} events to secondary sink for
watch id {id}.
Cause: N/A
watchconduit_suspended STATE_CHANGE INFO no Message: GPFS Watch Conduit for watch id {id} is suspended.
Cause: N/A
watchconduit_vanished INFO_DELETE_ENTITY INFO no Message: GPFS Watch Conduit for watch id {id} has vanished.
Description: GPFS Watch Conduit, which is listed in the IBM Storage Scale
configuration, has been removed.
Cause: N/A
watchconduit_warn STATE_CHANGE WARNING no Message: {0} warning: {1} is encountered for watch id {id}.
Cause: N/A
watchfolder_service_err STATE_CHANGE ERROR no Message: Error is loading the librdkafka library for watchfolder producers.
Cause: N/A
Cause: N/A
watchfolderp_auth_err STATE_CHANGE ERROR no Message: Error obtaining authentication credentials for Kafka
authentication. Error message: {2}.
Cause: N/A
watchfolderp_auth_info TIP TIP no Message: Authentication information for Kafka is not present or outdated.
Request to update credentials has been started and new credentials are
used on next event. Message: {2}.
Cause: N/A
watchfolderp_auth_warn STATE_CHANGE WARNING no Message: Authentication credentials for Kafka could not be obtained.
Attempt to update credentials are made later. Message: {2}.
Cause: N/A
watchfolderp_create_err STATE_CHANGE ERROR no Message: Error is encountered while creating a new loading or configuring
event producer. Error message: {2}.
Cause: N/A
watchfolderp_found INFO_ADD_ENTITY INFO no Message: New event producer for {id} was configured.
Cause: N/A
watchfolderp_log_err STATE_CHANGE ERROR no Message: Error opening or writing to event producer log file.
Cause: N/A
watchfolderp_msg_send_err STATE_CHANGE ERROR no Message: Failed to send Kafka message for file system {2}. Error message:
{3}.
Cause: N/A
User Action: Check the connectivity to Kafka broker and topic, and
whether a broker can accept new messages for the given topic. For
more information, check '/var/adm/ras/mmfs.log.latest' and '/var/adm/ras/
mmmsgqueue.log'.
watchfolderp_msg_send_sto STATE_CHANGE ERROR no Message: Failed to send more than {2} kafka messages. Producer is now
p shutdown and no more messages are sent.
Cause: N/A
watchfolderp_msgq_unsuppo STATE_CHANGE ERROR no Message: Message queue is no longer supported and no clustered watch
rted folder or file audit logging commands can run until the message queue is
removed.
Cause: N/A
watchfolderp_ok STATE_CHANGE INFO no Message: Event producer for file system {2} is OK.
Cause: N/A
watchfolderp_vanished INFO_DELETE_ENTITY INFO no Message: An event producer for {id} has been removed.
Cause: N/A
Unreachable The cloud provider access point URL is Ensure that the cloud provider is
unreachable because of network issues online. Check whether the network
or when the cloud provider access is reachable between the cloud
point URL is down. provider and Transparent cloud
tiering. Also, check the DNS settings.
For more information, check the
trace messages and error log.
no_route_to_cs The response from the cloud storage Check whether the following
p access point is invalid. conditions are met:
• DNS and firewall settings are
configured.
• Network is reachable between
Transparent cloud tiering and the
cloud provider.
invalid_cloud_ Transparent cloud tiering refuses to Check whether the cloud object
config connect to the CSAP. store is configured correctly.
For a Swift cloud provider, check
whether both Keystone and Swift
provider configuration are proper.
Also, check whether Swift is
reachable over Keystone.
credentials_in The Transparent cloud tiering service Check whether the access key
valid fails to connect to CSAP because of and secret key are valid. Also,
failed authentication. check whether the username and
password are correct.
mcstore_node_n The network of the Transparent cloud Check whether the network interface
etwork_down tiering node is down. on the Transparent cloud tiering
node is proper and is able to
communicate with public and private
networks.
ssl_handshake_ The CSAP fails due to an unknown SSL Check whether the following
exception handshake error. conditions are met:
• Cloud provider supports secured
communication and is properly
configured with certificate chain.
• The provided cloud provider URL is
secure (HTTPS).
• A secure connection to cloud
provider is established by
running the openssl s_client
-connect <cloud provider
ipaddress>: <secured_port>
command on a Transparent cloud
tiering node.
SSL handshake Transparent cloud tiering fails to Check whether the following
sock closed connect to the CSAP because the conditions are met:
exception. remote host closed connection during
the handshake. • A secure network connection is
established and that secure port is
reachable.
• A secured connection is
established to the cloud provider
by running the openssl
s_client -connect
<cloud_provider_ipaddress>
: <secured_port> command on
the Transparent cloud tiering node.
SSL handshake Transparent cloud tiering fails to Ensure that a self-signed or internal
bad certificate connect to the CSAP because the CA-signed certificate is properly
exception server certificate does not exist the added to the Transparent cloud
truststore. tiering truststore. Use the –server-
cert-path option to add a self-
signed certificate.
SSL handshake Transparent cloud tiering fails to Ensure that the cloud provider
failure exception connect to the CSAP provider because supports TLSv1.2 protocol and
it might not negotiate the required level TLSv1.2 enabled cipher suites.
of security.
SSL handshake Transparent cloud tiering fails to Ensure that a proper self-signed
unknown connect to the CSAP because of an or internal CA-signed certificate is
certificate unknown certificate. added to the Transparent cloud
exception tiering truststore. Use the –server-
cert-path option to add a self-
signed certificate.
SSL key exception Transparent cloud tiering fails to Check whether the following
connect to the CSAP because of bad conditions are met:
SSL key or misconfiguration.
• The SSL configuration on cloud
provider is proper.
• The Transparent cloud
tiering truststore, /var/
MCStore/.mcstore.jceks, is
not corrupted. If the /var/
MCStore/.mcstore.jceks is
corrupted, then remove it
and restart the server. This
action replaces the /var/
MCStore/.mcstore.jceks from
the CCR file.
SSL protocol Transparent cloud tiering failed to For more information, check trace
exception connect to the cloud provider because messages and error logs.
of an error in the operation of the SSL
protocol.
SSL exception Transparent cloud tiering failed to For more information, check trace
connect to the cloud provider because messages and error logs.
of an error in the SSL subsystem.
SSL no certificate Transparent cloud tiering failed to For more information, check trace
exception connect to the cloud provider because messages and error logs.
a certificate was not available.
TCT SSL not Transparent cloud tiering failed to For more information, check trace
Account trusted certificate connect to the cloud provider because messages and error logs.
Status exception. it might not locate a trusted server
certificate.
SSL invalid Transparent cloud tiering failed to For more information, check trace
algorithm connect to the cloud provider because messages and error logs.
exception of invalid or inappropriate SSL
algorithm parameters.
SSL invalid Transparent cloud tiering failed to For more information, check trace
padding exception connect to the cloud provider because messages and error logs.
of invalid SSL padding.
SSL unrecognized Transparent cloud tiering failed to For more information, check trace
message connect cloud provider because of an messages and error logs.
unrecognized SSL message.
Bad request Transparent cloud tiering failed to For more information, check trace
connect to the cloud provider because messages and error logs.
of a request error.
Precondition failed Transparent cloud tiering failed to For more information, check trace
connect to the cloud provider because messages and error logs.
of a precondition failed error.
Default exception The cloud provider account is not For more information, check trace
accessible due to an unknown error. messages and error logs.
Time skew The time that is observed on Change the Transparent cloud tiering
the Transparent cloud tiering service service node timestamp to be in
node is not in sync with the time on the sync with the NTP server and rerun
target cloud provider. the operation.
Server error Transparent cloud tiering failed to For more information, check trace
connect to the cloud provider because messages and error logs.
of a cloud provider server error (HTTP
503) or the container size reached max
storage limit.
Internal directory Transparent cloud tiering failed For more information, check trace
not found because one of its internal directories messages and error logs.
is not found.
Database The database of Transparent cloud For more information, check trace
corrupted tiering service is corrupted. messages and error logs. Run
the mmcloudgateway files
rebuildDB command to repair
when any issues are found.
TCT Not configured Transparent cloud tiering installed, but Run the mmcloudgateway
File the file system is not configured or it filesystem create command to
System was deleted. configure the file system.
Status
Configured The Transparent cloud tiering is N/A
configured with a file system.
TCT Stopped Transparent cloud tiering service is Run the mmcloudgateway
Server stopped by CLI command or stopped service start command to start
Status itself due to some error. the cloud gateway service.
OFFLINE The remote key manager that is The SKLM server is not accessible.
configured for Transparent cloud tiering This condition is valid only when
is not accessible. ISKLM is configured.
OFFLINE The local key manager who is The local .jks file is not found
configured for Transparent cloud tiering or corrupted. This action is valid
is either not found or corrupted. only when local key manager is
configured.
TCT STOPPED The cloud gateway service is down and The server is stopped abruptly, for
Service might not be started. example, JVM crash.
STARTED The cloud gateway service is up and The TCT service is started up and
running. running.
The cloud gateway check returns an The TCT service status has an
unknown result. unknown value.
{"typeURI":"http://schemas.dmtf.org/cloud/audit/1.0/event","eventType":"activity","id":
"b4e9a5a9-0bf7-45ee-9e93-b6f825781328","eventTime":"2017-08-21T18:46:10.439 UTC","action":
"create/create_cloudaccount","outcome":"success","initiator":{"id":"b22ec254-d645-43c4-
a402-3e15757d8463",
"typeURI":"data/security/account/admin","name":"root","host":{"address":"192.0.2.0"}},"target":
{"id":"58347894-6a10-4218-a66d-357e4a3f4aaf","typeURI":"service/storage/object/account","name":
"tct.cloudstorageaccesspoint"},"observer":{"id":"target"},"attachments":[{"content":"account-
name=
swift-account, cloud-type=openstack-swift, username=admin, tenant=admin, src-keystore-
path=null,
src-alias-name=null, src-keystore-type=null","name":"swift-account","contentType":"text"}]}
Messages
This topic contains explanations for GPFS error messages.
Messages for IBM Storage Scale RAID in the ranges 6027-1850 – 6027-1899 and 6027-3000 –
6027-3099 are documented in IBM Storage Scale RAID: Administration.
[E] or [E:nnn]
If more than one substring within a message matches this pattern (for example, [A] or [A:nnn]), the
severity tag is the first such matching string.
When the severity tag includes a numeric code (nnn), this is an error code associated with the message. If
this were the only problem encountered by the command, the command return code would be nnn.
6027-000 Attention: A disk being removed the verifyGpfsReady option via mmchconfig
reduces the number of failure verifyGpfsReady=no.
groups to nFailureGroups, which 6027-304 [W] script ended abnormally
is below the number required for
replication: nReplicas. Explanation:
The verifyGpfsReady=yes configuration attribute is
Explanation: set and /var/mmfs/etc/gpfsready script did not
Replication cannot protect data against disk failures complete successfully.
when there are insufficient failure groups.
User response:
User response: Make sure /var/mmfs/etc/gpfsready completes
Add more disks in new failure groups to the file system and returns a zero exit status, or disable
or accept the risk of data loss. the verifyGpfsReady option via mmchconfig
6027-300 [N] mmfsd ready verifyGpfsReady=no.
Explanation: 6027-305 [N] script failed with exit code code
The mmfsd server is up and running. Explanation:
User response: The verifyGpfsReady=yes configuration attribute is
None. Informational message only. set and /var/mmfs/etc/gpfsready script did not
complete successfully
6027-301 File fileName could not be run with
err errno. User response:
Make sure /var/mmfs/etc/gpfsready completes
Explanation: and returns a zero exit status, or disable
The named shell script could not be executed. This the verifyGpfsReady option via mmchconfig
message is followed by the error string that is returned verifyGpfsReady=no.
by the exec.
6027-306 [E] Could not initialize inter-node
User response:
communication.
Check file existence and access permissions.
Explanation:
6027-302 [E] Could not execute script The GPFS daemon was unable to initialize the
Explanation: communications required to proceed.
The verifyGpfsReady=yes configuration attribute User response:
is set, but the /var/mmfs/etc/gpfsready script User action depends on the return code shown in
could not be executed. the accompanying message (/usr/include/errno.h).
User response: The communications failure that caused the failure
Make sure /var/mmfs/etc/gpfsready exists and is must be corrected. One possibility is an rc value of
executable, or disable the verifyGpfsReady option 67, indicating that the required port is unavailable.
via mmchconfig verifyGpfsReady=no. This may mean that a previous version of the mmfs
daemon is still running. Killing that daemon may
6027-303 [N] script killed by signal signal resolve the problem.
Explanation: 6027-307 [E] All tries for command thread
The verifyGpfsReady=yes configuration attribute is
allocation failed for msgCommand
set and /var/mmfs/etc/gpfsready script did not
minor commandMinorNumber
complete successfully.
Explanation:
User response:
Make sure /var/mmfs/etc/gpfsready completes
and returns a zero exit status, or disable
Accessibility features
The following list includes the major accessibility features in IBM Storage Scale:
• Keyboard-only operation
• Interfaces that are commonly used by screen readers
• Keys that are discernible by touch but do not activate just by touching them
• Industry-standard devices for ports and connectors
• The attachment of alternative input and output devices
IBM Documentation, and its related publications, are accessibility-enabled.
Keyboard navigation
This product uses standard Microsoft Windows navigation keys.
IBM Director of Licensing IBM Corporation North Castle Drive, MD-NC119 Armonk, NY 10504-1785 US
For license inquiries regarding double-byte character set (DBCS) information, contact the IBM Intellectual
Property Department in your country or send inquiries, in writing, to:
Intellectual Property Licensing Legal and Intellectual Property Law IBM Japan Ltd. 19-21, Nihonbashi-
Hakozakicho, Chuo-ku Tokyo 103-8510, Japan
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS"
WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties in
certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically
made to the information herein; these changes will be incorporated in new editions of the publication.
IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this
publication at any time without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in
any manner serve as an endorsement of those websites. The materials at those websites are not part of
the materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.
Licensees of this program who wish to have information about it for the purpose of enabling: (i) the
exchange of information between independently created programs and other programs (including this
one) and (ii) the mutual use of the information which has been exchanged, should contact:
IBM Director of Licensing IBM Corporation North Castle Drive, MD-NC119 Armonk, NY 10504-1785 US
Such information may be available, subject to appropriate terms and conditions, including in some cases,
payment of a fee.
The licensed program described in this document and all licensed material available for it are provided by
IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any
equivalent agreement between us.
The performance data discussed herein is presented as derived under specific operating conditions.
Actual results may vary.
Information concerning non-IBM products was obtained from the suppliers of those products, their
published announcements or other publicly available sources. IBM has not tested those products and
Each copy or any portion of these sample programs or any derivative work must include
a copyright notice as follows:
If you are viewing this information softcopy, the photographs and color illustrations may not appear.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business
Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at
Copyright and trademark information at www.ibm.com/legal/copytrade.shtml.
Intel is a trademark of Intel Corporation or its subsidiaries in the United States and other countries.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or
its affiliates.
The registered trademark Linux is used pursuant to a sublicense from the Linux Foundation, the exclusive
licensee of Linus Torvalds, owner of the mark on a worldwide basis.
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or
both.
Red Hat, OpenShift®, and Ansible® are trademarks or registered trademarks of Red Hat, Inc. or its
subsidiaries in the United States and other countries.
UNIX is a registered trademark of the Open Group in the United States and other countries.
926 Notices
IBM Privacy Policy
At IBM we recognize the importance of protecting your personal information and are committed to
processing it responsibly and in compliance with applicable data protection laws in all countries in which
IBM operates.
Visit the IBM Privacy Policy for additional information on this topic at https://www.ibm.com/privacy/
details/us/en/.
Applicability
These terms and conditions are in addition to any terms of use for the IBM website.
Personal use
You can reproduce these publications for your personal, noncommercial use provided that all proprietary
notices are preserved. You cannot distribute, display, or make derivative work of these publications, or
any portion thereof, without the express consent of IBM.
Commercial use
You can reproduce, distribute, and display these publications solely within your enterprise provided
that all proprietary notices are preserved. You cannot make derivative works of these publications, or
reproduce, distribute, or display these publications or any portion thereof outside your enterprise, without
the express consent of IBM.
Rights
Except as expressly granted in this permission, no other permissions, licenses, or rights are granted,
either express or implied, to the Publications or any information, data, software or other intellectual
property contained therein.
IBM reserves the right to withdraw the permissions that are granted herein whenever, in its discretion, the
use of the publications is detrimental to its interest or as determined by IBM, the above instructions are
not being properly followed.
You cannot download, export, or reexport this information except in full compliance with all applicable
laws and regulations, including all United States export laws and regulations.
IBM MAKES NO GUARANTEE ABOUT THE CONTENT OF THESE PUBLICATIONS. THE PUBLICATIONS
ARE PROVIDED "AS-IS" AND WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED,
INCLUDING BUT NOT LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT,
AND FITNESS FOR A PARTICULAR PURPOSE.
Notices 927
928 IBM Storage Scale 5.1.9: Problem Determination Guide
Glossary
This glossary provides terms and definitions for IBM Storage Scale.
The following cross-references are used in this glossary:
• See refers you from a nonpreferred term to the preferred term or from an abbreviation to the spelled-
out form.
• See also refers you to a related or contrasting term.
For other terms and definitions, see the IBM Terminology website (www.ibm.com/software/globalization/
terminology) (opens in new window).
B
block utilization
The measurement of the percentage of used subblocks per allocated blocks.
C
cluster
A loosely coupled collection of independent systems (nodes) organized into a network for the purpose
of sharing resources and communicating with each other. See also GPFS cluster.
cluster configuration data
The configuration data that is stored on the cluster configuration servers.
Cluster Export Services (CES) nodes
A subset of nodes configured within a cluster to provide a solution for exporting GPFS file systems by
using the Network File System (NFS), Server Message Block (SMB), and Object protocols.
cluster manager
The node that monitors node status using disk leases, detects failures, drives recovery, and selects
file system managers. The cluster manager must be a quorum node. The selection of the cluster
manager node favors the quorum-manager node with the lowest node number among the nodes that
are operating at that particular time.
Note: The cluster manager role is not moved to another node when a node with a lower node number
becomes active.
clustered watch folder
Provides a scalable and fault-tolerant method for file system activity within an IBM Storage Scale
file system. A clustered watch folder can watch file system activity on a fileset, inode space, or an
entire file system. Events are streamed to an external Kafka sink cluster in an easy-to-parse JSON
format. For more information, see the mmwatch command in the IBM Storage Scale: Command and
Programming Reference Guide.
control data structures
Data structures needed to manage file data and metadata cached in memory. Control data structures
include hash tables and link pointers for finding cached data; lock states and tokens to implement
distributed locking; and various flags and sequence numbers to keep track of updates to the cached
data.
D
Data Management Application Program Interface (DMAPI)
The interface defined by the Open Group's XDSM standard as described in the publication
System Management: Data Storage Management (XDSM) API Common Application Environment (CAE)
Specification C429, The Open Group ISBN 1-85912-190-X.
E
ECKD
See extended count key data (ECKD).
ECKD device
See extended count key data device (ECKD device).
encryption key
A mathematical value that allows components to verify that they are in communication with the
expected server. Encryption keys are based on a public or private key pair that is created during the
installation process. See also file encryption key, master encryption key.
extended count key data (ECKD)
An extension of the count-key-data (CKD) architecture. It includes additional commands that can be
used to improve performance.
extended count key data device (ECKD device)
A disk storage device that has a data transfer rate faster than some processors can utilize and that is
connected to the processor through use of a speed matching buffer. A specialized channel program is
needed to communicate with such a device. See also fixed-block architecture disk device.
F
failback
Cluster recovery from failover following repair. See also failover.
failover
(1) The assumption of file system duties by another node when a node fails. (2) The process of
transferring all control of the ESS to a single cluster in the ESS when the other clusters in the ESS fails.
See also cluster. (3) The routing of all transactions to a second controller when the first controller fails.
See also cluster.
failure group
A collection of disks that share common access paths or adapter connections, and could all become
unavailable through a single hardware failure.
FEK
See file encryption key.
G
GPUDirect Storage
IBM Storage Scale's support for NVIDIA's GPUDirect Storage (GDS) enables a direct path between
GPU memory and storage. File system storage is directly connected to the GPU buffers to reduce
latency and load on CPU. Data is read directly from an NSD server's pagepool and it is sent to the GPU
buffer of the IBM Storage Scale clients by using RDMA.
global snapshot
A snapshot of an entire GPFS file system.
GPFS cluster
A cluster of nodes defined as being available for use by GPFS file systems.
GPFS portability layer
The interface module that each installation must build for its specific hardware platform and Linux
distribution.
Glossary 931
GPFS recovery log
A file that contains a record of metadata activity and exists for each node of a cluster. In the event of
a node failure, the recovery log for the failed node is replayed, restoring the file system to a consistent
state and allowing other nodes to continue working.
I
ill-placed file
A file assigned to one storage pool but having some or all of its data in a different storage pool.
ill-replicated file
A file with contents that are not correctly replicated according to the desired setting for that file. This
situation occurs in the interval between a change in the file's replication settings or suspending one of
its disks, and the restripe of the file.
independent fileset
A fileset that has its own inode space.
indirect block
A block containing pointers to other blocks.
inode
The internal structure that describes the individual files in the file system. There is one inode for each
file.
inode space
A collection of inode number ranges reserved for an independent fileset, which enables more efficient
per-fileset functions.
ISKLM
IBM Security Key Lifecycle Manager. For GPFS encryption, the ISKLM is used as an RKM server to
store MEKs.
J
journaled file system (JFS)
A technology designed for high-throughput server environments, which are important for running
intranet and other high-performance e-business file servers.
junction
A special directory entry that connects a name in a directory of one fileset to the root directory of
another fileset.
K
kernel
The part of an operating system that contains programs for such tasks as input/output, management
and control of hardware, and the scheduling of user tasks.
M
master encryption key (MEK)
A key used to encrypt other keys. See also encryption key.
MEK
See master encryption key.
metadata
Data structures that contain information that is needed to access file data. Metadata includes inodes,
indirect blocks, and directories. Metadata is not accessible to user applications.
metanode
The one node per open file that is responsible for maintaining file metadata integrity. In most cases,
the node that has had the file open for the longest period of continuous time is the metanode.
N
namespace
Space reserved by a file system to contain the names of its objects.
Network File System (NFS)
A protocol, developed by Sun Microsystems, Incorporated, that allows any host in a network to gain
access to another host or netgroup and their file directories.
Network Shared Disk (NSD)
A component for cluster-wide disk naming and access.
NSD volume ID
A unique 16-digit hex number that is used to identify and access all NSDs.
node
An individual operating-system image within a cluster. Depending on the way in which the computer
system is partitioned, it may contain one or more nodes.
node descriptor
A definition that indicates how GPFS uses a node. Possible functions include: manager node, client
node, quorum node, and nonquorum node.
node number
A number that is generated and maintained by GPFS as the cluster is created, and as nodes are added
to or deleted from the cluster.
node quorum
The minimum number of nodes that must be running in order for the daemon to start.
node quorum with tiebreaker disks
A form of quorum that allows GPFS to run with as little as one quorum node available, as long as there
is access to a majority of the quorum disks.
non-quorum node
A node in a cluster that is not counted for the purposes of quorum determination.
Non-Volatile Memory Express (NVMe)
An interface specification that allows host software to communicate with non-volatile memory
storage media.
P
policy
A list of file-placement, service-class, and encryption rules that define characteristics and placement
of files. Several policies can be defined within the configuration, but only one policy set is active at one
time.
policy rule
A programming statement within a policy that defines a specific action to be performed.
pool
A group of resources with similar characteristics and attributes.
Glossary 933
portability
The ability of a programming language to compile successfully on different operating systems without
requiring changes to the source code.
primary GPFS cluster configuration server
In a GPFS cluster, the node chosen to maintain the GPFS cluster configuration data.
private IP address
An IP address used to communicate on a private network.
public IP address
An IP address used to communicate on a public network.
Q
quorum node
A node in the cluster that is counted to determine whether a quorum exists.
quota
The amount of disk space and number of inodes assigned as upper limits for a specified user, group of
users, or fileset.
quota management
The allocation of disk blocks to the other nodes writing to the file system, and comparison of the
allocated space to quota limits at regular intervals.
R
Redundant Array of Independent Disks (RAID)
A collection of two or more disk physical drives that present to the host an image of one or more
logical disk drives. In the event of a single physical device failure, the data can be read or regenerated
from the other disk drives in the array due to data redundancy.
recovery
The process of restoring access to file system data when a failure has occurred. Recovery can involve
reconstructing data or providing alternative routing through a different server.
remote key management server (RKM server)
A server that is used to store master encryption keys.
replication
The process of maintaining a defined set of data in more than one location. Replication consists of
copying designated changes for one location (a source) to another (a target) and synchronizing the
data in both locations.
RKM server
See remote key management server.
rule
A list of conditions and actions that are triggered when certain conditions are met. Conditions include
attributes about an object (file name, type or extension, dates, owner, and groups), the requesting
client, and the container name associated with the object.
S
SAN-attached
Disks that are physically attached to all nodes in the cluster using Serial Storage Architecture (SSA)
connections or using Fibre Channel switches.
Scale Out Backup and Restore (SOBAR)
A specialized mechanism for data protection against disaster only for GPFS file systems that are
managed by IBM Storage Protect for Space Management.
secondary GPFS cluster configuration server
In a GPFS cluster, the node chosen to maintain the GPFS cluster configuration data in the event that
the primary GPFS cluster configuration server fails or becomes unavailable.
T
token management
A system for controlling file access in which each application performing a read or write operation
is granted some form of access to a specific block of file data. Token management provides data
consistency and controls conflicts. Token management has two components: the token management
server, and the token management function.
token management function
A component of token management that requests tokens from the token management server. The
token management function is located on each cluster node.
token management server
A component of token management that controls tokens relating to the operation of the file system.
The token management server is located at the file system manager node.
transparent cloud tiering (TCT)
A separately installable add-on feature of IBM Storage Scale that provides a native cloud storage tier.
It allows data center administrators to free up on-premise storage capacity, by moving out cooler data
to the cloud storage, thereby reducing capital and operational expenditures.
Glossary 935
twin-tailed
A disk connected to two nodes.
U
user storage pool
A storage pool containing the blocks of data that make up user files.
V
VFS
See virtual file system.
virtual file system (VFS)
A remote file system that has been mounted so that it is accessible to the local user.
virtual node (vnode)
The structure that contains information about a file system object in a virtual file system (VFS).
W
watch folder API
Provides a programming interface where a custom C program can be written that incorporates the
ability to monitor inode spaces, filesets, or directories for specific user activity-related events within
IBM Storage Scale file systems. For more information, a sample program is provided in the following
directory on IBM Storage Scale nodes: /usr/lpp/mmfs/samples/util called tswf that can be
modified according to the user's needs.
Index 937
B cluster security configuration 388
cluster state information 312
back up data 245 cluster status information, SNMP 219
backup clustered watch folder 237, 238
automatic backup of cluster data 356 collecting details of issues by using dumps 283
best practices for troubleshooting 245 Collecting details of issues by using logs, dumps, and traces
block allocation 485 283
Broken cluster recovery collector
mmsdrcommand 553 performance monitoring tool
multiple nodes 552 configuring 106, 114, 116
no CCR backup 552 migrating 116
single node 552 collector node
installing MIB files 214
commands
C cluster state information 312
call home conflicting invocation 377
mmcallhome 227 errpt 555
monitoring IBM Storage Scale system remotely 227 file system and disk information 317
callback events gpfs.snap 295–298, 301, 302, 308, 555
AFM 194 grep 274
candidate file lslpp 556
attributes 324 lslv 248
CCR 552, 553 lsof 318, 381, 382
CES lspv 414
IP removal 273 lsvg 413
monitoring 272, 273 lxtrace 283, 312
troubleshooting 272, 273 mmadddisk 392, 398, 413, 415, 422
upgrade error 273 mmaddnode 248, 351, 382
CES administration 272, 273 mmafmctl 404
CES collection 287–293 mmafmctl Device getstate 313
ces configuration issues 365 mmapplypolicy 319, 395, 396, 399, 436
CES exports mmauth 329, 388
NFS mmbackup 403
warning 49 mmchcluster 353
SMB mmchconfig 315, 358, 382, 391
warning 49 mmchdisk 378, 392, 398, 399, 407, 410, 415, 425
CES logs 263 mmcheckquota 276, 325, 366, 383
CES monitoring 272, 273 mmchfs 277, 357, 363, 378, 380, 383, 385, 454, 497
CES tracing 287–293 mmchnode 215, 216, 248
changing mode of AFM fileset 250 mmchnsd 407
checking, Persistent Reserve 426 mmcommon recoverfs 393
chosen file 320, 321 mmcommon showLocks 355
CIFS serving, Windows SMB2 protocol 367 mmcrcluster 248, 315, 351, 353, 358
cipherList 390 mmcrfs 363, 364, 407, 415, 454
Clearing a leftover Persistent Reserve reservation 426 mmcrnsd 407, 410
client node 380 mmcrsnapshot 400, 402
clock synchronization 258, 395 mmdeldisk 392, 398, 413, 415
Cloud Data sharing mmdelfileset 397
audit events 726 mmdelfs 424
Cloud services mmdelnode 351, 363
health status 713 mmdelnsd 410, 424
Cloud services audit events 726 mmdelsnapshot 401
cluster mmdf 393, 413, 496
deleting a node 363 mmdiag 313
cluster configuration information mmdsh 354
displaying 314 mmdumpperfdata 310
cluster configuration information, SNMP 219 mmexpelnode 316
Cluster Export Services mmfileid 327, 404, 415
administration 272, 273 mmfsadm 287, 312, 360, 404, 415, 497
issue collection 287, 289–293 mmfsck 250, 318, 377, 398, 404, 413, 423
monitoring 272, 273 mmgetstate 313, 359, 363
tracing 287, 289–293 mmlsattr 396, 397
cluster file systems mmlscluster 215, 248, 314, 351, 389
displaying 315 mmlsconfig 283, 315, 385
Index 939
diagnostic data encryption issues (continued)
deadlock diagnostics 294 issues with adding encryption policy 435
standard diagnostics 294 permission denied message 435
directed maintenance procedure ERRNO I/O error code 363
activate AFM 536 error
activate NFS 536 events 371
activate SMB 536 error codes
configure NFS sensors 537 EINVAL 395
configure SMB sensors 537 EIO 275, 407, 423, 424
increase fileset space 534, 539 ENODEV 361
mount file system 538 ENOENT 382
start gpfs daemon 534 ENOSPC 393, 423
start NSD 533 ERRNO I/O 363
start performance monitoring collector service 535 ESTALE 277, 361, 382
start performance monitoring sensor service 535 NO SUCH DIRECTORY 361
start the GUI service 538 NO SUCH FILE 361
synchronize node clocks 534 error log
directories MMFS_LONGDISKIO 276
.snapshots 400, 402 MMFS_QUOTA 276
/tmp/mmfs 249, 555 error logs
directory that has not been cached, traversing 251 AFM 279
disabling IPv6 example 277
for SSH connection delays 375 MMFS_ABNORMAL_SHUTDOWN 275
disabling Persistent Reserve manually 428 MMFS_DISKFAIL 275
disaster recovery MMFS_ENVIRON 275
other problems 478 MMFS_FSSTRUCT 275
setup problems 477 MMFS_GENERIC 275
disk access 411 MMFS_LONGDISKIO 276
disk commands MMFS_QUOTA 276, 325
hang 415 MMFS_SYSTEM_UNMOUNT 277
disk configuration information, SNMP 224 MMFS_SYSTEM_WARNING 277
disk connectivity failure 412 operating system 274
disk descriptor replica 383 error messages
disk failover 412 6027-1209 362
disk failure 424 6027-1242 354
disk leasing 429 6027-1290 393
disk performance information, SNMP 224 6027-1598 351
disk recovery 412 6027-1615 354
disk status information, SNMP 223 6027-1617 354
disk subsystem 6027-1627 364
failure 407 6027-1628 355
diskReadExclusionList 417 6027-1630 355
disks 6027-1631 355
damaged files 327 6027-1632 355
declared down 410 6027-1633 355
displaying information of 326 6027-1636 408
failure 275, 277, 407 6027-1661 408
media failure 415 6027-1662 410
partial failure 413 6027-1995 403
replacing 393 6027-1996 391
usage 384 6027-2108 408
disks down 413 6027-2109 408
disks, viewing 326 6027-300 358
displaying disk information 326 6027-306 360
displaying NSD information 408 6027-319 359, 360
DMP 533, 539 6027-320 360
DNS server failure 388 6027-321 360
6027-322 360
6027-341 357, 361
E 6027-342 357, 361
enable 6027-343 357, 361
performance monitoring sensors 112 6027-344 357, 361
enabling Persistent Reserve manually 428 6027-361 412
encryption issues 6027-418 384, 424
Index 941
file audit logging files (continued)
issues 523, 524 /etc/resolv.conf 375
JSON 523 /usr/lpp/mmfs/bin/runmmfs 284
logs 278 /usr/lpp/mmfs/samples/gatherlogs.samples.sh 258
monitor 233 /var/adm/ras/mmfs.log.previous 363
monitoring 233 /var/mmfs/etc/mmlock 354
states 233 /var/mmfs/gen/mmsdrfs 355
troubleshooting 523 detecting damage 327
file audit logs 235 mmfs.log 358, 359, 361, 377, 380, 382, 386–390, 555
File Authentication protocol authentication log 270
setup problems 436 fileset
file creation failure 435 issues 524
file level FILESET_NAME attribute 324
replica 418 filesets
file level replica 418 child 397
file migration deleting 397
problems 396 emptying 397
file placement policy 396 errors 398
file system lost+found 398
mount status 391 moving contents 397
space 393 performance 397
File system problems 393
high utilization 488 snapshots 397
file system descriptor unlinking 397
failure groups 383 usage errors 397
inaccessible 384 find custom events 19
file system manager FSDesc structure 383
cannot appoint 382 full file system or fileset 251
contact problems
communication paths unavailable 378
multiple failures 391, 392
G
file system mount failure 435 Ganesha 2.7
file system or fileset getting full 251 Unknown parameters
file system performance information, SNMP 222 Dispatch_Max_Reqs 451
file system status information, SNMP 221 GDS 431
file systems generate
cannot be unmounted 318 trace reports 283
creation failure 363 generating GPFS trace reports
determining if mounted 391 mmtracectl command 283
discrepancy between configuration data and on-disk getting started with troubleshooting 245
data 392 GPFS
displaying statistics 62, 65 /tmp/mmfs directory 249
do not mount 377 abnormal termination in mmpmon 498
does not mount 377 active file management 250
does not unmount 381 AFM 404
forced unmount 277, 382, 391, 392 AIX 386
free space shortage 402 application program errors 366
issues 524 authentication issues 436
listing mounted 318 automatic failure 543
loss of access 365 automatic recovery 543
remote 387 automount 385
remotely mounted 524 automount failure 386
reset statistics 75 automount failure in Linux 385
state after restore 402 checking Persistent Reserve 426
unable to determine if mounted 391 cipherList option has not been set properly 390
will not mount 318 clearing a leftover Persistent Reserve reservation 426
FILE_SIZE attribute 324 client nodes 380
files cluster configuration
.rhosts 353 issues 355
/etc/filesystems 378 cluster name 389
/etc/fstab 378 cluster security configurations 388
/etc/group 276 cluster state information commands 312–316
/etc/hosts 352 command 295–298, 301, 302, 308, 313
/etc/passwd 276 configuration data 392
Index 943
GPFS (continued) GPFS (continued)
NSD disk does not have an NSD server specified 390 remote file system 387, 388
NSD information 408 remote file system does not mount 387, 388
NSD is down 410 remote file system I/O failure 387
NSD server 380 remote mount failure 391
NSD subsystem failures 407 replicated data 422
NSDs built on top of AIX logical volume is down 413 replicated metadata 422, 423
offline mmfsck command failure 250 replication 412, 423
old inode data 443 Requeing message 404
on-disk data 392 requeuing of messages in AFM 404
Operating system error logs 274 restoring a snapshot 402
partial disk failure 413 Samba 454
permission denied error message 391 security issues 436
permission denied failure 436 set up 281
Persistent Reserve errors 425 setup issues 498
physical disk association 248 SMB server health 452
physical disk association with logical volume 248 snapshot directory name conflict 402
policies 395, 396 snapshot problems 400
predicted pool utilizations 395 snapshot status errors 401
problem determination 339, 342, 343, 346 snapshot usage errors 400, 401
problem determination hints 247 some files are 'ill-placed' 395
problem determination tips 247 stale inode data 443
problems not directly related to snapshots 400 storage pools 398, 399
problems while working with Samba in 454 strict replication 423
problems with locating a snapshot 400 system load increase in night 249
problems with non-IBM disks 415 timeout executing function error message 250
protocol service logs 270, 272, 273, 287–293 trace facility 283
quorum nodes in cluster 248 tracing the mmpmon command 499
RAS events troubleshooting 339, 342, 343, 346
AFM events 559 UID mapping 387
authentication events 565 unable to access disks 411
Call Home events 570 unable to determine if a file system is mounted 391
CES Network events 573 unable to start 351
CESIP events 578 underlying disk subsystem failures 407
cluster state events 578 understanding Persistent Reserve 425
disk events 582 unmount failure 381
Enclosure events 584 unused underlying multipath device 428
Encryption events 593 upgrade failure 544
File audit logging events 597 upgrade issues 370
file system events 602 upgrade recovery 544
Filesysmgr events 613 usage errors 395, 398
GDS events 615 using mmpmon 498
GPFS events 615 value to large failure 435
GUI events 634 value to large failure while creating a file 435
hadoop connector events 645 varyon problems 414
HDFS data node events 645 volume group 414
HDFS name node events 646 volume group on each node 414
keystone events 648 Windows file system 249
Local cache events 650 Windows issues 366, 367
network events 652 working with Samba 454
NFS events 660 GPFS cluster
Nvme events 671 problems adding nodes 351
NvmeoF events 674 recovery from loss of GPFS cluster configuration data
object events 675 files 355
performance events 685 GPFS cluster data
Server raid events 687 locked 354
SMB events 687 GPFS cluster data files storage 355
stretch cluster events 691 GPFS command
TCT events 693 failed 362
Threshold events 708 return code 362
Watchfolder events 710 unsuccessful 363
RDMA atomic operation issues 371 GPFS commands
remote cluster name 389 mmpmon 62
remote command issues 353, 354 unsuccessful 362
Index 945
IBM Storage Scale (continued) IBM Storage Scale (continued)
AFM (continued) data gathered for SMB on Linux 303
monitoring commands 194 data integrity may be corrupted 404
monitoring policies 199 deadlock breakup
AFM DR on demand 337
fileset states 189 deadlocks 333, 336, 337
monitoring commands 194 default parameter value 482
monitoring policies 199 deployment problem determination 339, 342, 343, 346
AFM logs 279 deployment troubleshooting 339, 342, 343, 346
AIX 386 determining the health of integrated SMB server 452
AIX platform 297 disaster recovery issues 477
application calls 357 discrepancy between GPFS configuration data and the
application program errors 365, 366 on-disk data for a file system 392
audit messages 258 disk accessing commands fail to complete 415
Authentication disk connectivity failure
error events 437 failover to secondary server 487
errors 437 disk failure 424
authentication issues 436 disk information commands 317–319, 326, 327, 329
authentication on Linux 307 disk media failure 422, 423
authorization issues 353, 438 disk media failures 415
automatic failure 543 disk recovery 412
automatic recovery 543 displaying NSD information 408
automount fails to mount on AIX 386 dumps 255
automount fails to mount on Linux 385 encryption issues 435
automount failure 386 encryption rules 435
automount failure in Linux 385 error creating internal storage 250
Automount file system 385 error encountered while creating and using NSD disks
Automount file system does not mount 385 407
back up data 245 error log 275, 276
call home entries 308 error message "Function not implemented" 387
CES NFS error message for file system 379, 380
failure 443 error messages 339
network failure 443 error numbers 380
CES tracing error numbers for GPFS application calls 392
debug data collection 287–293 error numbers specific to GPFS application calls 380,
checking Persistent Reserve 426 404, 424
cipherList option has not been set properly 390 Error numbers specific to GPFS application calls 384
clearing a leftover Persistent Reserve reservation 426 error numbers specific to GPFS application calls when
client nodes 380 data integrity may be corrupted 404
cluster configuration error numbers when a file system mount is unsuccessful
issues 354, 355 380
recovery 355 errors associated with filesets 393
cluster crash 351 errors associated with policies 393
cluster name 389 errors associated with storage pools 393
cluster state information 312–316 errors encountered 399
clusters with SELinux enabled and enforced 471 errors encountered while restoring a snapshot 402
collecting details of issues 278 errors encountered with filesets 398
command 313 errors encountered with policies 396
commands 312–316 errors encountered with storage pools 399
connectivity problems 354 events 17, 559, 565, 570, 573, 578, 582, 584, 593,
contact node address 388 597, 602, 613, 615, 634, 645, 646, 648, 650, 652, 660,
contact nodes down 389 671, 674, 675, 685, 687, 691, 693, 708, 710
core dumps 281 failure analysis 339
corrupted data integrity 404 failure group considerations 383
create script 17 failures using the mmbackup command 403
creating a file 435 file system block allocation type 485
data always gathered 296 file system commands 317–319, 326, 327, 329
data gathered file system does not mount 387
Object on Linux 304 file system fails to mount 377
data gathered for CES on Linux 306 file system fails to unmount 381
data gathered for core dumps on Linux 309 file system forced unmount 382
data gathered for hadoop on Linux 308 file system is forced to unmount 384
data gathered for performance 309 file system is known to have adequate free space 393
data gathered for protocols on Linux 303, 304, 306–309 file system is mounted 391
Index 947
IBM Storage Scale (continued) IBM Storage Scale (continued)
NFS (continued) remote file system I/O fails with "Function not
error scenarios 449 implemented" error 387
errors 447, 449 remote file system I/O failure 387
NFS client remote mounts fail with the "permission denied"error
client access exported data 449 391
client cannot mount NFS exports 449 remote node expelled from cluster 382
client I/O temporarily stalled 449 replicated metadata 423
NFS failover 452 replicated metadata and data 422
NFS is not active (nfs_not_active) 447 replication setting 494
NFS on Linux 304 requeuing of messages in AFM 404
NFS problems 443 RPC statd process is not running (statd_down) 447
NFS V4 issues 405 security issues 436, 438
nfs_not_active 447 set up 281
nfsd_down 447 setup issues while using mmpmon 498
no replication 423 SHA digest 329
NO_SPACE error 393 SMB
NSD and underlying disk subsystem failures 407 access issues 459, 460
NSD creation fails 410 error events 458
NSD disk does not have an NSD server specified 390 errors 458
NSD server 380 SMB client on Linux fails 455
NSD server failure 486 SMB service logs 263
NSD-to-server mapping 484 snapshot 495
offline mmfsck command failure 250 snapshot directory name conflict 402
old NFS inode data 443 snapshot problems 400
operating system error logs 274–277 snapshot status errors 401
operating system logs 274–277 snapshot usage errors 400, 401
other problem determination tools 331 some files are 'ill-placed' 395
partial disk failure 413 SSSD process not running (sssd_down) 437
Password invalid 455 stale inode data 443
performance issues 496 statd_down 447
permission denied error message 391 storage pools usage errors 398
permission denied failure 436 strict replication 423
Persistent Reserve errors 425 support for troubleshooting 555
physical disk association 248 System error 59 457
policies 395 System error 86 457
Portmapper port 111 is not active (portmapper_down) system health monitoring 50
447 system load 249
portmapper_down 447 threshold monitoring 22, 24
prerequisites 339 timeout executing function error message 250
problem determination 247, 339, 342, 343, 346 trace facility 283
problems while working with Samba 454 trace reports 283
problems with locating a snapshot 400 traces 255
problems with non-IBM disks 415 tracing the mmpmon command 499
protocol service logs troubleshooting
object logs 266 best practices 246, 247
QoSIO operation classes 483 collecting issue details 255, 256
quorum loss 364 getting started 245
quorum nodes 248 UID mapping 387
quorum nodes in cluster 248 unable to access disks 411
RAS events 559, 565, 570, 573, 578, 582, 584, 593, unable to determine if a file system is mounted 391
597, 602, 613, 615, 634, 645, 646, 648, 650, 652, 660, unable to resolve contact node address 388
671, 674, 675, 685, 687, 691, 693, 708, 710 understanding Persistent Reserve 425
RDMA atomic operation issues 371 unused underlying multipath device by GPFS 428
recovery procedures 543 upgrade failure 544
remote cluster name 389 upgrade issues 370
remote cluster name does not match with the cluster upgrade recovery 544
name 389 usage errors 395
remote command issues 353, 354 user does not exists 455
remote file system 387, 388 value to large failure 435
remote file system does not mount 387, 388 VERBS RDMA
remote file system does not mount due to differing GPFS inactive 490
cluster security configurations 388 volume group on each node 414
volume group varyon problems 414
Index 949
metadata mmdf command 393, 413, 496
replicated 422, 423 mmdiag command 313
metadata block mmdsh command 354
replica mismatches 417 mmdumpperfdata 310
metadata block replica mismatches 416, 417 mmedquota command fails 249
metrics mmexpelnode command 316
performance monitoring 117, 120, 136, 140, 143, 144, mmfileid command 327, 404, 415
146, 148 MMFS_ABNORMAL_SHUTDOWN
performance monitoring tool error logs 275
defining 154 MMFS_DISKFAIL
MIB files, installing on the collector and management node error logs 275
214 MMFS_ENVIRON
MIB objects, SNMP 218 error logs 275
MIGRATE rule 320, 322 MMFS_FSSTRUCT
migration error logs 275
file system does not mount 378 MMFS_GENERIC
new commands do not run 363 error logs 275
mmadddisk command 392, 398, 413, 415, 422 MMFS_LONGDISKIO
mmaddnode command 248, 351 error logs 276
mmafmctl command 404 MMFS_QUOTA
mmafmctl Device getstate command 313 error log 276
mmapplypolicy -L 0 320 error logs 276, 325
mmapplypolicy -L 1 321 MMFS_SYSTEM_UNMOUNT
mmapplypolicy -L 2 321 error logs 277
mmapplypolicy -L 3 322 MMFS_SYSTEM_WARNING
mmapplypolicy -L 4 323 error logs 277
mmapplypolicy -L 5 324 mmfs.log 358, 359, 361, 377, 380, 382, 386–390, 555
mmapplypolicy -L 6 324 mmfsadm command 287, 312, 360, 404, 415, 497
mmapplypolicy command 319, 395, 396, 399, 436 mmfsck command
mmaudit failure 250
failure 523 mmfsd
file system 523 will not start 358
mmauth command 329, 388 mmfslinux
mmbackup command 403 kernel module 357
mmccr command mmgetstate command 313, 359, 363
failure 250 mmhealth
mmchcluster command 353 monitoring 25, 44, 47
mmchconfig 417 states 233
mmchconfig command 315, 358, 382, 391 mmlock directory 354
mmchdisk command 378, 392, 398, 399, 407, 410, 415, mmlsattr command 396, 397
425 mmlscluster 215
mmcheckquota command 276, 325, 366, 383 mmlscluster command 248, 314, 351, 389
mmchfs command 277, 357, 363, 378, 380, 383, 385, 454, mmlsconfig command 283, 315, 385
497 mmlsdisk command 364, 377, 378, 383, 392, 407, 410,
mmchnode 215, 216 415, 425, 556
mmchnode command 248 mmlsfileset command 397
mmchnsd command 407 mmlsfs command 379, 422, 424, 556
mmchpolicy mmlsmgr command 312, 378
issues with adding encryption policy 435 mmlsmount command 318, 358, 365, 377, 381, 382, 407
mmcommon 385, 386 mmlsnsd command 326, 408, 413
mmcommon breakDeadlock 337 mmlspolicy command 396
mmcommon recoverfs command 393 mmlsquota command 365, 366
mmcommon showLocks command 355 mmlssnapshot command 400–402
mmcrcluster command 248, 315, 351, 353, 358 mmmount command 318, 377, 383, 415
mmcrfs command 363, 364, 407, 415, 454 mmperfmon 110, 150, 158, 170
mmcrnsd command 407, 410 mmperfmon command 154
mmcrsnapshot command 400, 402 mmpmon
mmdefedquota command fails 249 abend 498
mmdeldisk command 392, 398, 413, 415 adding nodes to a node list 68
mmdelfileset command 397 altering input file 498
mmdelfs command 424 concurrent processing 62
mmdelnode command 351, 363 concurrent usage 498
mmdelnsd command 410, 424 counters 104
mmdelsnapshot command 401 counters wrap 498
Index 951
mount error (127) (continued) NFS V4
Permission denied 456 problems 405
mount error (13) NFS, SMB, and Object logs 263
Permission denied 456 no replication 423
mount failure 435 NO SUCH DIRECTORY error code 361
mount on Mac fails with authentication error NO SUCH FILE error code 361
mount_smbfs: server rejected the connection: NO_SPACE
Authentication error 456 error 393
mount.cifs on Linux fails with mount error (13) node
Permission denied 456 crash 557
mounting cluster 390 hang 557
Mounting file system rejoin 381
error messages 379 node configuration information, SNMP 220
Multi-Media LAN Server 255 node crash 351
Multiple threshold rule node failure 429
Use case 29, 33, 36, 37, 39, 43 Node health state monitoring
use case 25
node reinstall 351
N node status information, SNMP 220
Net-SNMP nodes
configuring 212 cannot be added to GPFS cluster 351
installing 211 non-quorum node 248
running under SNMP master agent 216 NSD
traps 225 creating 410
network deleting 410
performance displaying information of 408
Remote Procedure Calls (RPCs) 55 extended information 409
network failure 375 failure 407
Network failure NSD build 413
mmnetverify command 375 NSD disks
network problems 275 creating 407
NFS using 407
failover NSD failure 407
warning event 452 NSD server
problems 443 failover to secondary server 486, 487
warning nsdServerWaitTimeForMount
nfs_exported_fs_down 49 changing 380
nfs_exports_down 49 nsdServerWaitTimeWindowOnMount
warning event changing 380
nfs_unresponsive 452 NT STATUS LOGON FAILURE
NFS client SMB client on Linux fails 455
with stale inode data 443
NFS client cannot mount exports O
mount exports, NFS client cannot mount 444
NFS error events 447 object
NFS error scenarios 449 logs 266
NFS errors Object 303
Ganesha NFSD process not running (nfsd_down) 447 object IDs
NFS is not active (nfs_not_active) 447 SNMP 218
nfs_not_active 447 object metrics
nfsd_down 447 proxy server 106, 114, 152
Portmapper port 111 is not active (portmapper_down) Object metrics
447 Performance monitoring 150
portmapper_down 447 open source tool
NFS logs 263 Grafana 158
NFS mount on client times out OpenSSH connection delays
NFS mount on server fails 444 Windows 375
Permission denied 444 orphaned file 398
time out error 444
NFS mount on server fails
access type is one 444
P
NFS mount on server fails partitioning information, viewing 326
no such file or directory 444 password must change 456
protocol version not supported by server 444
Index 953
proxies (continued) recovery
performance monitoring tool 105 cluster configuration data 355
proxy server recovery log 429
object metrics 106, 114, 152 recovery procedure
python 235 restore data and system configuration 543
recovery procedures 543
recreation of GPFS storage file
Q mmchcluster -p LATEST 355
QoSIO operation classes remote command problems 353
low values 483 remote file copy command
queries default 353
performance monitoring 171 remote file system
quorum mount 388
disk 364 remote file system I/O fails with "Function not implemented"
loss 364 error 387
quorum node 248 remote mounts fail with permission denied 391
quota remote node
cannot write to quota file 383 expelled 382
denied 365 remote node expelled 382
error number 357 Remote Procedure Calls (RPCs)
quota files 325 network performance 55
quota problems 276 remote shell
default 353
remotely mounted file systems 524
R Removing a sensor 110
removing the setuid bit 362
RAID controller 412
replica
raise custom events 19
mismatch 419
RAS events
mismatches 416–420
AFM events 559, 615
replica mismatches 416, 418–420
authentication events 565
replicated
Call Home events 570
metadata 423
CES Network events 573
replicated data 422
CESIP events 578
replicated metadata 422
cluster state events 578
replication
disk events 582
of data 412
Enclosure events 584
replication setting 494
Encryption events 593
replication, none 423
File audit logging events 597
report problems 247
file system events 602
reporting a problem to IBM 312
Filesysmgr events 613
request returns the current values for all latency ranges
GPFS events 615
which have a nonzero count.IBM Storage Scale 87
GUI events 634
resetting of setuid/setgids at AFM home 251
hadoop connector events 645
resolve events 246
HDFS data node events 645
restore data and system configuration 543
HDFS name node events 646
restricted mode mount 318
keystone events 648
return codes, mmpmon 104
Local cache events 650
RPC
network events 652
method 371
NFS events 660
RPC statistics
Nvme events 671
aggregation of execution time 89
NvmeoF events 674
RPC execution time 91
object events 675
rpc_s size 88
performance events 685
RPCs (Remote Procedure Calls)
Server RAID events 687
network performance 55
SMB events 687
rpm command 556
stretch cluster events 691
rsh
TCT events 693
problems using 353
Threshold events 708
rsh command 353, 363
Watchfolder events 710
rsyslog 236
rcp command 353
RDMA atomic operation issues
RDMA issue 371 S
read-only mode mount 318
Samba
Index 955
threshold monitoring trace (continued)
active threshold monitor 24 SMB locks 286
predefined thresholds 22 SP message handling 286
prerequisites 22 super operations 286
user-defined thresholds 22 tasking system 286
Threshold monitoring token manager 286
use case 29, 33, 36, 37, 39, 43 ts commands 284
Threshold rules vdisk 286
Create 29, 33, 36, 37, 39, 43 vdisk debugger 286
tiering vdisk hospital 286
audit events 726 vnode layer 286
time stamps 255 trace classes 284
tip events 539 trace facility
trace mmfsadm command 312
active file management 284 trace level 287
allocation manager 284 trace reports, generating 283
basic classes 284 tracing
behaviorals 286 active 494
byte range locks 284 transparent cloud tiering
call to routines in SharkMsg.h 285 common issues and workarounds 519
checksum services 284 troubleshooting 519
cleanup routines 284 transparent cloud tiering logs
cluster configuration repository 284 collecting 278
cluster security 286 traps, Net-SNMP 225
concise vnop description 286 traversing a directory that has not been cached 251
daemon routine entry/exit 284 troubleshooting
daemon specific code 286 AFM DR issues 515
data shipping 285 best practices
defragmentation 284 report problems 247
dentry operations 284 resolve events 246
disk lease 285 support notifications 246
disk space allocation 284 update software 246
DMAPI 285 capacity information is not available in GUI pages 509
error logging 285 CES 272
events exporter 285 CES NFS core dump 283
file operations 285 Cloud services 713
file system 285 collecting issue details 255
generic kernel vfs information 285 disaster recovery issues
inode allocation 285 setup problems 477
interprocess locking 285 getting started 245
kernel operations 285 GPUDirect Storage 431
kernel routine entry/exit 285 GUI fails to restart 502
low-level vfs locking 285 GUI fails to start 501, 503
mailbox message handling 285 GUI is displaying outdated information 506
malloc/free in shared segment 285 GUI issues 501
miscellaneous tracing and debugging 286 GUI login page does not open 503
mmpmon 285 GUI logs 294
mnode operations 285 GUI performance monitoring issues 504
mutexes and condition variables 285 logs
network shared disk 285 GPFS log 256, 258
online multinode fsck 285 syslog 258
operations in Thread class 286 mmwatch 531
page allocator 286 performance issues
parallel inode tracing 286 caused by the low-level system components 479
performance monitors 285 due to high utilization of the system-level
physical disk I/O 285 components 479
physical I/O 285 due to improper system level settings 481
pinning to real memory 286 due to long waiters 479
quota management 286 due to networking issues 480
rdma 286 due to suboptimal setup or configuration 481
recovery log 285 recovery procedures 543
SANergy 286 server was unable to process the request 506
scsi services 286 support for troubleshooting
shared segments 286 call home 558
U
UID mapping 387
umount command 382, 383
unable to start GPFS 360
underlying multipath device 428
understanding, Persistent Reserve 425
unsuccessful GPFS commands 362
upgrade
NSD nodes not connecting 370
regular expression evaluation 370
usage errors
policies 395
useNSDserver attribute 412
USER_ID attribute 324
using the gpfs.snap command 295
V
v 353
value too large failure 435
varyon problems 414
varyonvg command 415
VERBS RDMA
inactive 490
viewing disks and partitioning information 326
volume group 414
W
warranty and maintenance 247
Webhook
JSON 47
Winbind
logs 269
Windows
data always gathered 301
file system mounted on the wrong drive letter 250
gpfs.snap 301
Home and .ssh directory ownership and permissions
366
mounted file systems, Windows 249
OpenSSH connection delays 375
problem seeing newly mounted file systems 249
problem seeing newly mounted Windows file systems
249
problems running as administrator 367
Windows 250
Windows issues 366
Windows SMB2 protocol (CIFS serving) 367
Index 957
958 IBM Storage Scale 5.1.9: Problem Determination Guide
IBM®
SC28-3476-02